-
Empirical Evidence That There Is No Such Thing As A Validated Prediction Model
Authors:
Florian D. van Leeuwen,
Ewout W. Steyerberg,
David van Klaveren,
Ben Wessler,
David M. Kent,
Erik W. van Zwet
Abstract:
Background: External validations are essential to assess clinical prediction models (CPMs) before deployment. Apart from model misspecification, differences in patient population and other factors influence a model's AUC (c-statistic). We aimed to quantify variation in AUCs across external validation studies and adjust expectations of a model's performance in a new setting.
Methods: The Tufts-PA…
▽ More
Background: External validations are essential to assess clinical prediction models (CPMs) before deployment. Apart from model misspecification, differences in patient population and other factors influence a model's AUC (c-statistic). We aimed to quantify variation in AUCs across external validation studies and adjust expectations of a model's performance in a new setting.
Methods: The Tufts-PACE CPM Registry contains CPMs for cardiovascular disease prognosis. We analyzed the AUCs of 469 CPMs with a total of 1,603 external validations. For each CPM, we performed a random effects meta-analysis to estimate the between-study standard deviation $τ$ among the AUCs. Since the majority of these meta-analyses has only a handful of validations, this leads to very poor estimates of $τ$. So, we estimated a log normal distribution of $τ$ across all CPMs and used this as an empirical prior. We compared this empirical Bayesian approach with frequentist meta-analyses using cross-validation.
Results: The 469 CPMs had a median of 2 external validations (IQR: [1-3]). The estimated distribution of $τ$ had a mean of 0.055 and a standard deviation of 0.015. If $τ$ = 0.05, the 95% prediction interval for the AUC in a new setting is at least +/- 0.1, regardless of the number of validations. Frequentist methods underestimate the uncertainty about the AUC in a new setting. Accounting for $τ$ in a Bayesian approach achieved near nominal coverage.
Conclusion: Due to large heterogeneity among the validated AUC values of a CPM, there is great irreducible uncertainty in predicting the AUC in a new setting. This uncertainty is underestimated by existing methods. The proposed empirical Bayes approach addresses this problem which merits wide application in judging the validity of prediction models.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
G-formula for causal inference via multiple imputation
Authors:
Jonathan W. Bartlett,
Camila Olarte Parra,
Emily Granger,
Ruth H. Keogh,
Erik W. van Zwet,
Rhian M. Daniel
Abstract:
G-formula is a popular approach for estimating treatment or exposure effects from longitudinal data that are subject to time-varying confounding. G-formula estimation is typically performed by Monte-Carlo simulation, with non-parametric bootstrapping used for inference. We show that G-formula can be implemented by exploiting existing methods for multiple imputation (MI) for synthetic data. This in…
▽ More
G-formula is a popular approach for estimating treatment or exposure effects from longitudinal data that are subject to time-varying confounding. G-formula estimation is typically performed by Monte-Carlo simulation, with non-parametric bootstrapping used for inference. We show that G-formula can be implemented by exploiting existing methods for multiple imputation (MI) for synthetic data. This involves using an existing modified version of Rubin's variance estimator. In practice missing data is ubiquitous in longitudinal datasets. We show that such missing data can be readily accommodated as part of the MI procedure when using G-formula, and describe how MI software can be used to implement the approach. We explore its performance using a simulation study and an application from cystic fibrosis.
△ Less
Submitted 11 October, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
Think before you shrink: Alternatives to default shrinkage methods can improve prediction accuracy, calibration and coverage
Authors:
Mark A. van de Wiel,
Gwenaël G. R. Leday,
Jeroen Hoogland,
Martijn W. Heymans,
Erik W. van Zwet,
Ailko H. Zwinderman
Abstract:
While shrinkage is essential in high-dimensional settings, its use for low-dimensional regression-based prediction has been debated. It reduces variance, often leading to improved prediction accuracy. However, it also inevitably introduces bias, which may harm two other measures of predictive performance: calibration and coverage of confidence intervals. Much of the criticism stems from the usage…
▽ More
While shrinkage is essential in high-dimensional settings, its use for low-dimensional regression-based prediction has been debated. It reduces variance, often leading to improved prediction accuracy. However, it also inevitably introduces bias, which may harm two other measures of predictive performance: calibration and coverage of confidence intervals. Much of the criticism stems from the usage of standard shrinkage methods, such as lasso and ridge with a single, cross-validated penalty. Our aim is to show that readily available alternatives can strongly improve predictive performance, in terms of accuracy, calibration or coverage. For linear regression, we use small sample splits of a large, fairly typical epidemiological data set to illustrate this. We show that usage of differential ridge penalties for covariate groups may enhance prediction accuracy, while calibration and coverage benefit from additional shrinkage of the penalties. In the logistic setting, we apply an external simulation to demonstrate that local shrinkage improves calibration with respect to global shrinkage, while providing better prediction accuracy than other solutions, like Firth's correction. The benefits of the alternative shrinkage methods are easily accessible via example implementations using \texttt{mgcv} and \texttt{r-stan}, including the estimation of multiple penalties. A synthetic copy of the large data set is shared for reproducibility.
△ Less
Submitted 24 January, 2023;
originally announced January 2023.
-
Benchmarking survival outcomes: A funnel plot for survival data
Authors:
Hein Putter,
Dirk-Jan Eikema,
Liesbeth C. de Wreede,
Eoin McGrath,
Isabel Sanchez-Ortega,
Riccardo Saccardi,
John A. Snowden,
Erik W. van Zwet
Abstract:
Benchmarking is commonly used in many healthcare settings to monitor clinical performance, with the aim of increasing cost-effectiveness and safe care of patients. The funnel plot is a popular tool in visualizing the performance of a healthcare center in relation to other centers and to a target, taking into account statistical uncertainty. In this paper we develop methodology for constructing fun…
▽ More
Benchmarking is commonly used in many healthcare settings to monitor clinical performance, with the aim of increasing cost-effectiveness and safe care of patients. The funnel plot is a popular tool in visualizing the performance of a healthcare center in relation to other centers and to a target, taking into account statistical uncertainty. In this paper we develop methodology for constructing funnel plots for survival data. The method takes into account censoring and can deal with differences in censoring distributions across centers. Practical issues in implementing the methodology are discussed, particularly in the setting of benchmarking clinical outcomes for hematopoietic stem cell transplantation. A simulation study is performed to assess the performance of the funnel plots under several scenarios. Our methodology is illustrated using data from the EBMT benchmarking project.
△ Less
Submitted 26 April, 2021;
originally announced April 2021.
-
Simultaneous Confidence Intervals for Ranks With Application to Ranking Institutions
Authors:
Diaa Al Mohamad,
Jelle J. Goeman,
Erik W. van Zwet
Abstract:
When a ranking of institutions such as medical centers or universities is based on an indicator provided with a standard error, confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals for the ranks of means based on an observed sample. For this aim, the only available method from the literature uses M…
▽ More
When a ranking of institutions such as medical centers or universities is based on an indicator provided with a standard error, confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals for the ranks of means based on an observed sample. For this aim, the only available method from the literature uses Monte-Carlo simulations and is highly anticonservative especially when the means are close to each other or have ties. We present a novel method based on Tukey's honest significant difference test (HSD). Our new method is on the contrary conservative when there are no ties. By properly rescaling these two methods to the nominal confidence level, they surprisingly perform very similarly. The Monte-Carlo method is however unscalable when the number of institutions is large than 30 to 50 and stays thus anticonservative. We provide extensive simulations to support our claims and the two methods are compared in terms of their simultaneous coverage and their efficiency. We provide a data analysis for 64 hospitals in the Netherlands and compare both methods. Software for our new methods is available online in package ICRanks downloadable from CRAN. Supplementary materials include supplementary R code for the simulations and proofs of the propositions presented in this paper.
△ Less
Submitted 11 December, 2018;
originally announced December 2018.
-
Adaptive Critical Value for Constrained Likelihood Ratio Testing
Authors:
Diaa Al Mohamad,
Jelle J. Goeman,
Erik W. van Zwet,
Eric A. Cator
Abstract:
We present a new way of testing ordered hypotheses against all alternatives which overpowers the classical approach both in simplicity and statistical power. Our new method tests the constrained likelihood ratio statistic against the quantile of one and only one chi-squared random variable with a data-dependent degrees of freedom instead of a mixture of chi-squares. Our new test is proved to have…
▽ More
We present a new way of testing ordered hypotheses against all alternatives which overpowers the classical approach both in simplicity and statistical power. Our new method tests the constrained likelihood ratio statistic against the quantile of one and only one chi-squared random variable with a data-dependent degrees of freedom instead of a mixture of chi-squares. Our new test is proved to have a valid finite-sample significance level $α$ and provides more power especially for sparse alternatives (those with a few or moderate number of null constraints violations) in comparison to the classical approach. Our method is also easier to use than the classical approach which requires to calculate or simulate a set of complicated weights. Two special cases are considered with more details, namely the case of testing orthants $μ_1<0, \cdots, μ_n<0$ and the isotonic case of testing $μ_1<μ_2<μ_3$ against all alternatives. Contours of the difference in power are shown for these examples showing the interest of our new approach.
△ Less
Submitted 25 June, 2018; v1 submitted 4 June, 2018;
originally announced June 2018.
-
Simultaneous confidence sets for ranks using the partitioning principle - Technical report
Authors:
Diaa Al Mohamad,
Erik W. van Zwet,
Jelle J. Goeman,
Aldo Solari
Abstract:
Ranking institutions such as medical centers or universities is based on an indicator accompanied with an uncertainty measure such as a standard deviation, and confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals for the ranks of centers based on an observed sample. We present in this paper a novel…
▽ More
Ranking institutions such as medical centers or universities is based on an indicator accompanied with an uncertainty measure such as a standard deviation, and confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals for the ranks of centers based on an observed sample. We present in this paper a novel method based on multiple testing which uses the partitioning principle and employs the likelihood ratio (LR) test on the partitions. The complexity of the algorithm is super exponential. We present several ways and shortcuts to reduce this complexity. We provide also a polynomial algorithm which produces a very good bracketing for the multiple testing by linearizing the critical value of the LR test. We show that Tukey's Honest Significant Difference (HSD) test can be written as a partitioning procedure. The new methodology has promising properties in the sens that it opens the door in a simple and easy way to construct new methods which may trade the exponential complexity with power of the test or vice versa. In comparison to Tukey's HSD test, the LR test seems to give better results when the centers are close to each others or the uncertainty in the data is high which is confirmed during a simulation study.
△ Less
Submitted 9 August, 2017;
originally announced August 2017.
-
An improvement of Tukey's HSD with application to ranking institutions
Authors:
Diaa Al Mohamad,
Jelle J. Goeman,
Erik W. van Zwet
Abstract:
When a ranking of institutions such as medical centers or universities is based on an indicator provided with a standard error, confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals (CIs) for the ranks of centers based on an observed sample. We present a novel method based on Tukey's honest signific…
▽ More
When a ranking of institutions such as medical centers or universities is based on an indicator provided with a standard error, confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals (CIs) for the ranks of centers based on an observed sample. We present a novel method based on Tukey's honest significant difference test (HSD) which is the first method to produce valid simultaneous CIs for ranks. Moreover, we introduce a new variant of Tukey's HSD based on the sequential rejection principle. The new algorithm ensures familywise error control, and produces simultaneous confidence intervals for the ranks uniformly shorter than those provided by Tukey's HSD for the same level of significance. We illustrate the method through both simulations and real data analysis from 64 hospitals in the Netherlands. Software for our new methods is available online in package \texttt{ICRanks} downloadable from CRAN. Supplementary materials include supplementary R code for the simulations and proofs of the propositions presented in this paper.
△ Less
Submitted 22 November, 2018; v1 submitted 8 August, 2017;
originally announced August 2017.