-
An Overview and Recent Developments in the Analysis of Multistate Processes
Authors:
Malka Gorfine,
Richard J. Cook,
Per Kragh Andersen,
Terry M. Therneau,
Pierre Joly,
Hein Putter,
Maja Pohar Perme,
Michal Abrahamowicz
Abstract:
Multistate models offer a powerful framework for studying disease processes and can be used to formulate intensity-based and more descriptive marginal regression models. They also represent a natural foundation for the construction of joint models for disease processes and dynamic marker processes, as well as joint models incorporating random censoring and intermittent observation times. This arti…
▽ More
Multistate models offer a powerful framework for studying disease processes and can be used to formulate intensity-based and more descriptive marginal regression models. They also represent a natural foundation for the construction of joint models for disease processes and dynamic marker processes, as well as joint models incorporating random censoring and intermittent observation times. This article reviews the ways multistate models can be formed and fitted to life history data. Recent work on pseudo-values and the incorporation of random effects to model dependence on the process history and between-process heterogeneity are also discussed. The software available to facilitate such analyses is listed.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Heterogeneous Treatment Effect in Time-to-Event Outcomes: Harnessing Censored Data with Recursively Imputed Trees
Authors:
Tomer Meir,
Uri Shalit,
Malka Gorfine
Abstract:
Tailoring treatments to individual needs is a central goal in fields such as medicine. A key step toward this goal is estimating Heterogeneous Treatment Effects (HTE) - the way treatments impact different subgroups. While crucial, HTE estimation is challenging with survival data, where time until an event (e.g., death) is key. Existing methods often assume complete observation, an assumption viola…
▽ More
Tailoring treatments to individual needs is a central goal in fields such as medicine. A key step toward this goal is estimating Heterogeneous Treatment Effects (HTE) - the way treatments impact different subgroups. While crucial, HTE estimation is challenging with survival data, where time until an event (e.g., death) is key. Existing methods often assume complete observation, an assumption violated in survival data due to right-censoring, leading to bias and inefficiency. Cui et al. (2023) proposed a doubly-robust method for HTE estimation in survival data under no hidden confounders, combining a causal survival forest with an augmented inverse-censoring weighting estimator. However, we find it struggles under heavy censoring, which is common in rare-outcome problems such as Amyotrophic lateral sclerosis (ALS). Moreover, most current methods cannot handle instrumental variables, which are a crucial tool in the causal inference arsenal. We introduce Multiple Imputation for Survival Treatment Response (MISTR), a novel, general, and non-parametric method for estimating HTE in survival data. MISTR uses recursively imputed survival trees to handle censoring without directly modeling the censoring mechanism. Through extensive simulations and analysis of two real-world datasets-the AIDS Clinical Trials Group Protocol 175 and the Illinois unemployment dataset we show that MISTR outperforms prior methods under heavy censoring in the no-hidden-confounders setting, and extends to the instrumental variable setting. To our knowledge, MISTR is the first non-parametric approach for HTE estimation with unobserved confounders via instrumental variables.
△ Less
Submitted 4 June, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
Cost-Effectiveness Analysis for Disease Prevention -- A Case Study on Colorectal Cancer Screening
Authors:
Yi Xiong,
Kwun C G Chan,
Malka Gorfine,
Li Hsu
Abstract:
Cancer Screening has been widely recognized as an effective strategy for preventing the disease. Despite its effectiveness, determining when to start screening is complicated, because starting too early increases the number of screenings over lifetime and thus costs but starting too late may miss the cancer that could have been prevented. Therefore, to make an informed recommendation on the age to…
▽ More
Cancer Screening has been widely recognized as an effective strategy for preventing the disease. Despite its effectiveness, determining when to start screening is complicated, because starting too early increases the number of screenings over lifetime and thus costs but starting too late may miss the cancer that could have been prevented. Therefore, to make an informed recommendation on the age to start screening, it is necessary to conduct cost-effectiveness analysis to assess the gain in life years relative to the cost of screenings. As more large-scale observational studies become accessible, there is growing interest in evaluating cost-effectiveness based on empirical evidence. In this paper, we propose a unified measure for evaluating cost-effectiveness and a causal analysis for the continuous intervention of screening initiation age, under the multi-state modeling with semi-competing risks. Extensive simulation results show that the proposed estimators perform well in realistic scenarios. We perform a cost-effectiveness analysis of the colorectal cancer screening, utilizing data from the large-scale Women's Health Initiative. Our analysis reveals that initiating screening at age 50 years yields the highest quality-adjusted life years with an acceptable incremental cost-effectiveness ratio compared to no screening, providing real-world evidence in support of screening recommendation for colorectal cancer.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
Confidence Intervals and Simultaneous Confidence Bands Based on Deep Learning
Authors:
Asaf Ben Arie,
Malka Gorfine
Abstract:
Deep learning models have significantly improved prediction accuracy in various fields, gaining recognition across numerous disciplines. Yet, an aspect of deep learning that remains insufficiently addressed is the assessment of prediction uncertainty. Producing reliable uncertainty estimators could be crucial in practical terms. For instance, predictions associated with a high degree of uncertaint…
▽ More
Deep learning models have significantly improved prediction accuracy in various fields, gaining recognition across numerous disciplines. Yet, an aspect of deep learning that remains insufficiently addressed is the assessment of prediction uncertainty. Producing reliable uncertainty estimators could be crucial in practical terms. For instance, predictions associated with a high degree of uncertainty could be sent for further evaluation. Recent works in uncertainty quantification of deep learning predictions, including Bayesian posterior credible intervals and a frequentist confidence-interval estimation, have proven to yield either invalid or overly conservative intervals. Furthermore, there is currently no method for quantifying uncertainty that can accommodate deep neural networks for survival (time-to-event) data that involves right-censored outcomes. In this work, we provide a valid non-parametric bootstrap method that correctly disentangles data uncertainty from the noise inherent in the adopted optimization algorithm, ensuring that the resulting point-wise confidence intervals or the simultaneous confidence bands are accurate (i.e., valid and not overly conservative). The proposed ad-hoc method can be easily integrated into any deep neural network without interfering with the training process. The utility of the proposed approach is illustrated by constructing simultaneous confidence bands for survival curves derived from deep neural networks for survival data with right censoring.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions
Authors:
Tal Agassi,
Nir Keret,
Malka Gorfine
Abstract:
In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the optimal subsample…
▽ More
In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the optimal subsample size. To bridge this gap, our work introduces tools designed for choosing the optimal subsample size. We focus on three settings: the Cox regression model for survival data with rare events and logistic regression for both balanced and imbalanced datasets. Additionally, we present a novel optimal subsampling procedure tailored for logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Cumulative Incidence Function Estimation Based on Population-Based Biobank Data
Authors:
Malka Gorfine,
David M. Zucker,
Shoval Shoham
Abstract:
Many countries have established population-based biobanks, which are being used increasingly in epidemiolgical and clinical research. These biobanks offer opportunities for large-scale studies addressing questions beyond the scope of traditional clinical trials or cohort studies. However, using biobank data poses new challenges. Typically, biobank data is collected from a study cohort recruited ov…
▽ More
Many countries have established population-based biobanks, which are being used increasingly in epidemiolgical and clinical research. These biobanks offer opportunities for large-scale studies addressing questions beyond the scope of traditional clinical trials or cohort studies. However, using biobank data poses new challenges. Typically, biobank data is collected from a study cohort recruited over a defined calendar period, with subjects entering the study at various ages falling between $c_L$ and $c_U$. This work focuses on biobank data with individuals reporting disease-onset age upon recruitment, termed prevalent data, along with individuals initially recruited as healthy, and their disease onset observed during the follow-up period. We propose a novel cumulative incidence function (CIF) estimator that efficiently incorporates prevalent cases, in contrast to existing methods, providing two advantages: (1) increased efficiency, and (2) CIF estimation for ages before the lower limit, $c_L$.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Unveiling Challenges in Mendelian Randomization for Gene-Environment Interaction
Authors:
Malka Gorfine,
Conghui Qu,
Ulrike Peters,
Li Hsu
Abstract:
Many diseases and traits involve a complex interplay between genes and environment, generating significant interest in studying gene-environment interaction through observational data. However, for lifestyle and environmental risk factors, they are often susceptible to unmeasured confounding factors and as a result, may bias the assessment of the joint effect of gene and environment. Recently, Men…
▽ More
Many diseases and traits involve a complex interplay between genes and environment, generating significant interest in studying gene-environment interaction through observational data. However, for lifestyle and environmental risk factors, they are often susceptible to unmeasured confounding factors and as a result, may bias the assessment of the joint effect of gene and environment. Recently, Mendelian randomization (MR) has evolved into a versatile method for assessing causal relationships based on observational data to account for unmeasured confounders. This approach utilizes genetic variants as instrumental variables (IVs) and aims to offer a reliable statistical test and estimation of causal effects. MR has gained substantial popularity in recent years largely due to the success of large-scale genome-wide association studies in identifying genetic variants associated with lifestyle and environmental factors. Many methods have been developed for MR; however, little work has been done for evaluating gene-environment interaction. In this paper, we focus on two primary IV approaches: the 2-stage predictor substitution (2SPS) and the 2-stage residual inclusion (2SRI), and extend them to accommodate gene-environment interaction under both the linear and logistic regression models for the continuous and binary outcomes, respectively. Extensive simulation and analytical derivations show that finding solutions in the linear regression model setting is relatively straightforward; however, the logistic regression model is significantly more complex and demands additional effort.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Unlocking Retrospective Prevalent Information in EHRs -- a Pairwise Pseudolikelihood Approach
Authors:
Nir Keret,
Malka Gorfine
Abstract:
Typically, electronic health record data are not collected towards a specific research question. Instead, they comprise numerous observations recruited at different ages, whose medical, environmental and oftentimes also genetic data are being collected. Some phenotypes, such as disease-onset ages, may be reported retrospectively if the event preceded recruitment, and such observations are termed `…
▽ More
Typically, electronic health record data are not collected towards a specific research question. Instead, they comprise numerous observations recruited at different ages, whose medical, environmental and oftentimes also genetic data are being collected. Some phenotypes, such as disease-onset ages, may be reported retrospectively if the event preceded recruitment, and such observations are termed ``prevalent". The standard method to accommodate this ``delayed entry" conditions on the entire history up to recruitment, hence the retrospective prevalent failure times are conditioned upon and cannot participate in estimating the disease-onset age distribution. An alternative approach conditions just on survival up to recruitment age, plus the recruitment age itself. This approach allows incorporating the prevalent information but brings about numerical and computational difficulties. In this work we develop consistent estimators of the coefficients in a regression model for the age-at-onset, while utilizing the prevalent data. Asymptotic results are provided, and simulations are conducted to showcase the substantial efficiency gain that may be obtained by the proposed approach. In particular, the method is highly useful in leveraging large-scale repositories for replicability analysis of genetic variants. Indeed, analysis of urinary bladder cancer data reveals that the proposed approach yields about twice as many replicated discoveries compared to the popular approach.
△ Less
Submitted 3 September, 2023;
originally announced September 2023.
-
Discrete-time Competing-Risks Regression with or without Penalization
Authors:
Tomer Meir,
Malka Gorfine
Abstract:
Many studies employ the analysis of time-to-event data that incorporates competing risks and right censoring. Most methods and software packages are geared towards analyzing data that comes from a continuous failure time distribution. However, failure-time data may sometimes be discrete either because time is inherently discrete or due to imprecise measurement. This paper introduces a new estimati…
▽ More
Many studies employ the analysis of time-to-event data that incorporates competing risks and right censoring. Most methods and software packages are geared towards analyzing data that comes from a continuous failure time distribution. However, failure-time data may sometimes be discrete either because time is inherently discrete or due to imprecise measurement. This paper introduces a new estimation procedure for discrete-time survival analysis with competing events. The proposed approach offers a major key advantage over existing procedures and allows for straightforward integration and application of widely used regularized regression and screening-features methods. We illustrate the benefits of our proposed approach by a comprehensive simulation study. Additionally, we showcase the utility of the proposed procedure by estimating a survival model for the length of stay of patients hospitalized in the intensive care unit, considering three competing events: discharge to home, transfer to another medical facility, and in-hospital death. A Python package, PyDTS, is available for applying the proposed method with additional features.
△ Less
Submitted 5 February, 2025; v1 submitted 2 March, 2023;
originally announced March 2023.
-
Shared Frailty Methods for Complex Survival Data: A Review of Recent Advances
Authors:
Malka Gorfine,
David M. Zucker
Abstract:
Dependent survival data arise in many contexts. One context is clustered survival data, where survival data are collected on clusters such as families or medical centers. Dependent survival data also arise when multiple survival times are recorded for each individual. Frailty models is one common approach to handle such data. In frailty models, the dependence is expressed in terms of a random effe…
▽ More
Dependent survival data arise in many contexts. One context is clustered survival data, where survival data are collected on clusters such as families or medical centers. Dependent survival data also arise when multiple survival times are recorded for each individual. Frailty models is one common approach to handle such data. In frailty models, the dependence is expressed in terms of a random effect, called the frailty. Frailty models have been used with both Cox proportional hazards model and the accelerated failure time model. This paper reviews recent developments in the area of frailty models in a variety of settings. In each setting we provide a detailed model description, assumptions, available estimation methods, and R packages.
△ Less
Submitted 11 May, 2022;
originally announced May 2022.
-
An Accelerated Failure Time Regression Model for Illness-Death Data: A Frailty Approach
Authors:
Lea Kats,
Malka Gorfine
Abstract:
This work presents a new model and estimation procedure for the illness-death survival data where the hazard functions follow accelerated failure time (AFT) models. A shared frailty variate induces positive dependence among failure times of a subject for handling the unobserved dependency between the non-terminal and the terminal failure times given the observed covariates. Semi-parametric maximum…
▽ More
This work presents a new model and estimation procedure for the illness-death survival data where the hazard functions follow accelerated failure time (AFT) models. A shared frailty variate induces positive dependence among failure times of a subject for handling the unobserved dependency between the non-terminal and the terminal failure times given the observed covariates. Semi-parametric maximum likelihood estimation procedure is developed via a kernel smoothed-aided EM algorithm, and variances are estimated by weighted bootstrap. The model is presented in the context of existing frailty-based illness-death models, emphasizing the contribution of the current work. The breast cancer data of the Rotterdam tumor bank are analyzed using the proposed and existing illness-death models. The results are contrasted and evaluated based on a new graphical goodness-of-fit procedure. Simulation results and data analysis nicely demonstrate the practical utility of the shared frailty variate with the AFT regression model under the illness-death framework.
△ Less
Submitted 8 May, 2022;
originally announced May 2022.
-
PyDTS: A Python Package for Discrete-Time Survival (Regularized) Regression with Competing Risks
Authors:
Tomer Meir,
Rom Gutman,
Malka Gorfine
Abstract:
Time-to-event analysis (survival analysis) is used when the response of interest is the time until a pre-specified event occurs. Time-to-event data are sometimes discrete either because time itself is discrete or due to grouping of failure times into intervals or rounding off measurements. In addition, the failure of an individual could be one of several distinct failure types, known as competing…
▽ More
Time-to-event analysis (survival analysis) is used when the response of interest is the time until a pre-specified event occurs. Time-to-event data are sometimes discrete either because time itself is discrete or due to grouping of failure times into intervals or rounding off measurements. In addition, the failure of an individual could be one of several distinct failure types, known as competing risks (events). Most methods and software packages for survival regression analysis assume that time is measured on a continuous scale. It is well-known that naively applying standard continuous-time models with discrete-time data may result in biased estimators of the discrete-time models. The Python package PyDTS, for simulating, estimating and evaluating semi-parametric competing-risks models for discrete-time survival data, is introduced. The package implements a fast procedure that enables including regularized regression methods, such as LASSO and elastic net, among others. A simulation study showcases flexibility and accuracy of the package. The utility of the package is demonstrated by analysing the Medical Information Mart for Intensive Care (MIMIC) - IV dataset for prediction of hospitalization length of stay.
△ Less
Submitted 27 June, 2023; v1 submitted 12 April, 2022;
originally announced April 2022.
-
Revisiting the Cumulative Incidence Function With Competing Risks Data
Authors:
David M. Zucker,
Malka Gorfine
Abstract:
We consider estimation of the cumulative incidence function (CIF) in the competing risks Cox model. We study three methods. Methods 1 and 2 are existing methods while Method 3 is a newly-proposed method. Method 3 is constructed so that the sum of the CIF's across all event types at the last observed event time is guaranteed, assuming no ties, to be equal to 1. The performance of the methods is exa…
▽ More
We consider estimation of the cumulative incidence function (CIF) in the competing risks Cox model. We study three methods. Methods 1 and 2 are existing methods while Method 3 is a newly-proposed method. Method 3 is constructed so that the sum of the CIF's across all event types at the last observed event time is guaranteed, assuming no ties, to be equal to 1. The performance of the methods is examined in a simulation study, and the methods are illustrated on a data example from the field of computer code comprehension. The newly-proposed Method 3 exhibits performance comparable to that of Methods 1 and 2 in terms of bias, variance, and confidence interval coverage rates. Thus, with our newly-proposed estimator, the advantage of having the end-of-study total CIF equal to 1 is achieved with no price to be paid in terms of performance.
△ Less
Submitted 23 February, 2022;
originally announced February 2022.
-
Optimal Cox Regression Subsampling Procedure with Rare Events
Authors:
Nir Keret,
Malka Gorfine
Abstract:
Massive sized survival datasets are becoming increasingly prevalent with the development of the healthcare industry. Such datasets pose computational challenges unprecedented in traditional survival analysis use-cases. A popular way for coping with massive datasets is downsampling them to a more manageable size, such that the computational resources can be afforded by the researcher. Cox proportio…
▽ More
Massive sized survival datasets are becoming increasingly prevalent with the development of the healthcare industry. Such datasets pose computational challenges unprecedented in traditional survival analysis use-cases. A popular way for coping with massive datasets is downsampling them to a more manageable size, such that the computational resources can be afforded by the researcher. Cox proportional hazards regression has remained one of the most popular statistical models for the analysis of survival data to-date. This work addresses the settings of right censored and possibly left truncated data with rare events, such that the observed failure times constitute only a small portion of the overall sample. We propose Cox regression subsampling-based estimators that approximate their full-data partial-likelihood-based counterparts, by assigning optimal sampling probabilities to censored observations, and including all observed failures in the analysis. Asymptotic properties of the proposed estimators are established under suitable regularity conditions, and simulation studies are carried out to evaluate the finite sample performance of the estimators. We further apply our procedure on UK-biobank colorectal cancer genetic and environmental risk factors.
△ Less
Submitted 30 January, 2022; v1 submitted 3 December, 2020;
originally announced December 2020.
-
Causal inference for semi-competing risks data
Authors:
Daniel Nevo,
Malka Gorfine
Abstract:
An emerging challenge for time-to-event data is studying semi-competing risks, namely when two event times are of interest: a non-terminal event time (e.g. age at disease diagnosis), and a terminal event time (e.g. age at death). The non-terminal event is observed only if it precedes the terminal event, which may occur before or after the non-terminal event. Studying treatment or intervention effe…
▽ More
An emerging challenge for time-to-event data is studying semi-competing risks, namely when two event times are of interest: a non-terminal event time (e.g. age at disease diagnosis), and a terminal event time (e.g. age at death). The non-terminal event is observed only if it precedes the terminal event, which may occur before or after the non-terminal event. Studying treatment or intervention effects on the dual event times is complicated because for some units, the non-terminal event may occur under one treatment value but not under the other. Until recently, existing approaches (e.g., the survivor average causal effect) generally disregarded the time-to-event nature of both outcomes. More recent research focused on principal strata effects within time-varying populations under Bayesian approaches. In this paper, we propose alternative non time-varying estimands, based on a single stratification of the population. We present a novel assumption utilizing the time-to-event nature of the data, which is weaker than the often-invoked monotonicity assumption. We derive results on partial identifiability, suggest a sensitivity analysis approach, and give conditions under which full identification is possible. Finally, we present non-parametric and semi-parametric estimation methods for right-censored data.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
Efficient Study Design with Multiple Measurement Instruments
Authors:
Michal Bitan,
Malka Gorfine,
Laura Rosen,
David M. Steinberg
Abstract:
Outcomes from studies assessing exposure often use multiple measurements. In previous work, using a model first proposed by Buonoccorsi (1991), we showed that combining direct (e.g. biomarkers) and indirect (e.g. self-report) measurements provides a more accurate picture of true exposure than estimates obtained when using a single type of measurement. In this article, we propose a valuable tool fo…
▽ More
Outcomes from studies assessing exposure often use multiple measurements. In previous work, using a model first proposed by Buonoccorsi (1991), we showed that combining direct (e.g. biomarkers) and indirect (e.g. self-report) measurements provides a more accurate picture of true exposure than estimates obtained when using a single type of measurement. In this article, we propose a valuable tool for efficient design of studies that include both direct and indirect measurements of a relevant outcome. Based on data from a pilot or preliminary study, the tool, which is available online as a shiny app \citep{shinyR}, can be used to compute: (1) the sample size required for a statistical power analysis, while optimizing the percent of participants who should provide direct measures of exposure (biomarkers) in addition to the indirect (self-report) measures provided by all participants; (2) the ideal number of replicates; and (3) the allocation of resources to intervention and control arms. In addition we show how to examine the sensitivity of results to underlying assumptions. We illustrate our analysis using studies of tobacco smoke exposure and nutrition. In these examples, a near-optimal allocation of the resources can be found even if the assumptions are not precise.
△ Less
Submitted 29 September, 2020;
originally announced September 2020.
-
Marginalized Frailty-Based Illness-Death Model: Application to the UK-Biobank Survival Data
Authors:
Malka Gorfine,
Nir Keret,
Asaf Ben Arie,
David Zucker,
Li Hsu
Abstract:
The UK Biobank is a large-scale health resource comprising genetic, environmental and medical information on approximately 500,000 volunteer participants in the UK, recruited at ages 40--69 during the years 2006--2010. The project monitors the health and well-being of its participants. This work demonstrates how these data can be used to estimate in a semi-parametric fashion the effects of genetic…
▽ More
The UK Biobank is a large-scale health resource comprising genetic, environmental and medical information on approximately 500,000 volunteer participants in the UK, recruited at ages 40--69 during the years 2006--2010. The project monitors the health and well-being of its participants. This work demonstrates how these data can be used to estimate in a semi-parametric fashion the effects of genetic and environmental risk factors on the hazard functions of various diseases, such as colorectal cancer. An illness-death model is adopted, which inherently is a semi-competing risks model, since death can censor the disease, but not vice versa. Using a shared-frailty approach to account for the dependence between time to disease diagnosis and time to death, we provide a new illness-death model that assumes Cox models for the marginal hazard functions. The recruitment procedure used in this study introduces delayed entry to the data. An additional challenge arising from the recruitment procedure is that information coming from both prevalent and incident cases must be aggregated. Lastly, we do not observe any deaths prior to the minimal recruitment age, 40. In this work we provide an estimation procedure for our new illness-death model that overcomes all the above challenges.
△ Less
Submitted 27 May, 2019;
originally announced June 2019.
-
$K$-sample omnibus non-proportional hazards tests based on right-censored data
Authors:
Malka Gorfine,
Matan Schlesinger,
Li Hsu
Abstract:
This work presents novel and powerful tests for comparing non-proportional hazard functions, based on sample-space partitions. Right censoring introduces two major difficulties which make the existing sample-space partition tests for uncensored data non-applicable: (i) the actual event times of censored observations are unknown; and (ii) the standard permutation procedure is invalid in case the ce…
▽ More
This work presents novel and powerful tests for comparing non-proportional hazard functions, based on sample-space partitions. Right censoring introduces two major difficulties which make the existing sample-space partition tests for uncensored data non-applicable: (i) the actual event times of censored observations are unknown; and (ii) the standard permutation procedure is invalid in case the censoring distributions of the groups are unequal. We overcome these two obstacles, introduce invariant tests, and prove their consistency. Extensive simulations reveal that under non-proportional alternatives, the proposed tests are often of higher power compared with existing popular tests for non-proportional hazards. Efficient implementation of our tests is available in the R package KONPsurv, which can be freely downloaded from {https://github.com/matan-schles/KONPsurv
△ Less
Submitted 27 October, 2019; v1 submitted 17 January, 2019;
originally announced January 2019.
-
An improved fully nonparametric estimator of the marginal survival function based on case-control clustered data
Authors:
David M. Zucker,
Malka Gorfine
Abstract:
A case-control family study is a study where individuals with a disease of interest (case probands) and individuals without the disease (control probands) are randomly sampled from a well-defined population. Possibly right-censored age at onset and disease status are observed for both probands and their relatives. Correlation among the outcomes within a family is induced by factors such as inherit…
▽ More
A case-control family study is a study where individuals with a disease of interest (case probands) and individuals without the disease (control probands) are randomly sampled from a well-defined population. Possibly right-censored age at onset and disease status are observed for both probands and their relatives. Correlation among the outcomes within a family is induced by factors such as inherited genetic susceptibility, shared environment, and common behavior patterns. For this setting, we present a nonparametric estimator of the marginal survival function, based on local linear estimation of conditional survival functions. Asymptotic theory for the estimator is provided, and simulation results are presented showing that the method performs well. The method is illustrated on data from a prostate cancer study.
Keywords: case-control; family study; multivariate survival; nonparametric estimator; local linear
△ Less
Submitted 3 December, 2018;
originally announced December 2018.
-
General Semiparametric Shared Frailty Model Estimation and Simulation with frailtySurv
Authors:
John V. Monaco,
Malka Gorfine,
Li Hsu
Abstract:
The R package frailtySurv for simulating and fitting semi-parametric shared frailty models is introduced. Package frailtySurv implements semi-parametric consistent estimators for a variety of frailty distributions, including gamma, log-normal, inverse Gaussian and power variance function, and provides consistent estimators of the standard errors of the parameters' estimators. The parameters' estim…
▽ More
The R package frailtySurv for simulating and fitting semi-parametric shared frailty models is introduced. Package frailtySurv implements semi-parametric consistent estimators for a variety of frailty distributions, including gamma, log-normal, inverse Gaussian and power variance function, and provides consistent estimators of the standard errors of the parameters' estimators. The parameters' estimators are asymptotically normally distributed, and therefore statistical inference based on the results of this package, such as hypothesis testing and confidence intervals, can be performed using the normal distribution. Extensive simulations demonstrate the flexibility and correct implementation of the estimator. Two case studies performed with publicly available datasets demonstrate applicability of the package. In the Diabetic Retinopathy Study, the onset of blindness is clustered by patient, and in a large hard drive failure dataset, failure times are thought to be clustered by the hard drive manufacturer and model.
△ Less
Submitted 5 September, 2018; v1 submitted 21 February, 2017;
originally announced February 2017.
-
Consistent distribution-free $K$-sample and independence tests for univariate random variables
Authors:
Ruth Heller,
Yair Heller,
Shachar Kaufman,
Barak Brill,
Malka Gorfine
Abstract:
A popular approach for testing if two univariate random variables are statistically independent consists of partitioning the sample space into bins, and evaluating a test statistic on the binned data. The partition size matters, and the optimal partition size is data dependent. While for detecting simple relationships coarse partitions may be best, for detecting complex relationships a great gain…
▽ More
A popular approach for testing if two univariate random variables are statistically independent consists of partitioning the sample space into bins, and evaluating a test statistic on the binned data. The partition size matters, and the optimal partition size is data dependent. While for detecting simple relationships coarse partitions may be best, for detecting complex relationships a great gain in power can be achieved by considering finer partitions. We suggest novel consistent distribution-free tests that are based on summation or maximization aggregation of scores over all partitions of a fixed size. We show that our test statistics based on summation can serve as good estimators of the mutual information. Moreover, we suggest regularized tests that aggregate over all partition sizes, and prove those are consistent too. We provide polynomial-time algorithms, which are critical for computing the suggested test statistics efficiently. We show that the power of the regularized tests is excellent compared to existing tests, and almost as powerful as the tests based on the optimal (yet unknown in practice) partition size, in simulations as well as on a real data example.
△ Less
Submitted 18 June, 2015; v1 submitted 24 October, 2014;
originally announced October 2014.
-
A Quantile Regression Model for Failure-Time Data with Time-Dependent Covariates
Authors:
Malka Gorfine,
Yair Goldberg,
Yaacov Ritov
Abstract:
Since survival data occur over time, often important covariates that we wish to consider also change over time. Such covariates are referred as time-dependent covariates. Quantile regression offers flexible modeling of survival data by allowing the covariates to vary with quantiles. This paper provides a novel quantile regression model accommodating time-dependent covariates, for analyzing surviva…
▽ More
Since survival data occur over time, often important covariates that we wish to consider also change over time. Such covariates are referred as time-dependent covariates. Quantile regression offers flexible modeling of survival data by allowing the covariates to vary with quantiles. This paper provides a novel quantile regression model accommodating time-dependent covariates, for analyzing survival data subject to right censoring. Our simple estimation technique assumes the existence of instrumental variables. In addition, we present a doubly-robust estimator in the sense of Robins and Rotnitzky (1992). The asymptotic properties of the estimators are rigorously studied. Finite-sample properties are demonstrated by a simulation study. The utility of the proposed methodology is demonstrated using the Stanford heart transplant dataset.
△ Less
Submitted 30 April, 2014;
originally announced April 2014.
-
Consistent distribution-free tests of association between univariate random variables
Authors:
Ruth Heller,
Yair Heller,
Shachar Kaufman,
Malka Gorfine
Abstract:
We consider the problem of testing whether pairs of univariate random variables are associated. Few tests of independence exist that are consistent against all dependent alternatives and are distribution free. We propose novel tests that are consistent, distribution free, and have excellent power properties. The tests have simple form, and are surprisingly computationally efficient thanks to accom…
▽ More
We consider the problem of testing whether pairs of univariate random variables are associated. Few tests of independence exist that are consistent against all dependent alternatives and are distribution free. We propose novel tests that are consistent, distribution free, and have excellent power properties. The tests have simple form, and are surprisingly computationally efficient thanks to accompanying innovative algorithms we develop. Moreover, we show that one of the test statistics is a consistent estimator of the mutual information. We demonstrate the good power properties in simulations, and apply the tests to a microarray study where many pairs of genes are examined simultaneously for co-dependence.
△ Less
Submitted 8 December, 2014; v1 submitted 7 August, 2013;
originally announced August 2013.
-
A consistent multivariate test of association based on ranks of distances
Authors:
Ruth Heller,
Yair Heller,
Malka Gorfine
Abstract:
We are concerned with the detection of associations between random vectors of any dimension. Few tests of independence exist that are consistent against all dependent alternatives. We propose a powerful test that is applicable in all dimensions and is consistent against all alternatives. The test has a simple form and is easy to implement. We demonstrate its good power properties in simulations an…
▽ More
We are concerned with the detection of associations between random vectors of any dimension. Few tests of independence exist that are consistent against all dependent alternatives. We propose a powerful test that is applicable in all dimensions and is consistent against all alternatives. The test has a simple form and is easy to implement. We demonstrate its good power properties in simulations and on examples.
△ Less
Submitted 31 May, 2012; v1 submitted 17 January, 2012;
originally announced January 2012.