-
A categorization of performance measures for estimated non-linear associations between an outcome and continuous predictors
Authors:
Theresa Ullmann,
Georg Heinze,
Michal Abrahamowicz,
Aris Perperoglou,
Willi Sauerbrei,
Matthias Schmid,
Daniela Dunkler,
for TG2 of the STRATOS initiative
Abstract:
In regression analysis, associations between continuous predictors and the outcome are often assumed to be linear. However, modeling the associations as non-linear can improve model fit. Many flexible modeling techniques, like (fractional) polynomials and spline-based approaches, are available. Such methods can be systematically compared in simulation studies, which require suitable performance me…
▽ More
In regression analysis, associations between continuous predictors and the outcome are often assumed to be linear. However, modeling the associations as non-linear can improve model fit. Many flexible modeling techniques, like (fractional) polynomials and spline-based approaches, are available. Such methods can be systematically compared in simulation studies, which require suitable performance measures to evaluate the accuracy of the estimated curves against the true data-generating functions. Although various measures have been proposed in the literature, no systematic overview exists so far. To fill this gap, we introduce a categorization of performance measures for evaluating estimated non-linear associations between an outcome and continuous predictors. This categorization includes many commonly used measures. The measures can not only be used in simulation studies, but also in application studies to compare different estimates to each other. We further illustrate and compare the behavior of different performance measures through some examples and a Shiny app.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Flexible tree-structured regression for clustered data with an application to quality of life in older adults
Authors:
Nikolai Spuck,
Matthias Schmid,
Moritz Berger
Abstract:
Tree-structured models are a powerful alternative to parametric regression models if non-linear effects and interactions are present in the data. Yet, classical tree-structured models might not be appropriate if data comes in clusters of units, which requires taking the dependence of observations into account. This is, for example, the case in cross-national studies, as presented here, where count…
▽ More
Tree-structured models are a powerful alternative to parametric regression models if non-linear effects and interactions are present in the data. Yet, classical tree-structured models might not be appropriate if data comes in clusters of units, which requires taking the dependence of observations into account. This is, for example, the case in cross-national studies, as presented here, where country-specific effects should not be neglected. To address this issue, we present a flexible tree-structured approach that achieves a sparse modeling of unit-specific effects and identifies subgroups (based on individual-level covariates) that differ with regard to the outcome. The methodological advances were motivated by the analysis of quality of life in older adults using data from the survey of Health, Ageing and Retirement in Europe. Application of the proposed model yields promising results and illustrated the accessibility of the approach. A comparison to alternative methods with regard to variable selection and goodness-of-fit was performed in several simulation experiments.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
Modeling the restricted mean survival time using pseudo-value random forests
Authors:
Alina Schenk,
Vanessa Basten,
Matthias Schmid
Abstract:
The restricted mean survival time (RMST) has become a popular measure to summarize event times in longitudinal studies. Defined as the area under the survival function up to a time horizon $τ$ > 0, the RMST can be interpreted as the life expectancy within the time interval [0, $τ$]. In addition to its straightforward interpretation, the RMST also allows for the definition of valid estimands for th…
▽ More
The restricted mean survival time (RMST) has become a popular measure to summarize event times in longitudinal studies. Defined as the area under the survival function up to a time horizon $τ$ > 0, the RMST can be interpreted as the life expectancy within the time interval [0, $τ$]. In addition to its straightforward interpretation, the RMST also allows for the definition of valid estimands for the causal analysis of treatment contrasts in medical studies. In this work, we introduce a non-parametric approach to model the RMST conditional on a set of baseline variables (including, e.g., treatment variables and confounders). Our method is based on a direct modeling strategy for the RMST, using leave-one-out jackknife pseudo-values within a random forest regression framework. In this way, it can be employed to obtain precise estimates of both patient-specific RMST values and confounder-adjusted treatment contrasts. Since our method (termed "pseudo-value random forest", PVRF) is model-free, RMST estimates are not affected by restrictive assumptions like the proportional hazards assumption. Particularly, PVRF offers a high flexibility in detecting relevant covariate effects from higher-dimensional data, thereby expanding the range of existing pseudo-value modeling techniques for RMST estimation. We investigate the properties of our method using simulations and illustrate its use by an application to data from the SUCCESS-A breast cancer trial. Our numerical experiments demonstrate that PVRF yields accurate estimates of both patient-specific RMST values and RMST-based treatment contrasts.
△ Less
Submitted 2 November, 2024;
originally announced November 2024.
-
Achieving interpretable machine learning by functional decomposition of black-box models into explainable predictor effects
Authors:
David Köhler,
David Rügamer,
Matthias Schmid
Abstract:
Machine learning (ML) has seen significant growth in both popularity and importance. The high prediction accuracy of ML models is often achieved through complex black-box architectures that are difficult to interpret. This interpretability problem has been hindering the use of ML in fields like medicine, ecology and insurance, where an understanding of the inner workings of the model is paramount…
▽ More
Machine learning (ML) has seen significant growth in both popularity and importance. The high prediction accuracy of ML models is often achieved through complex black-box architectures that are difficult to interpret. This interpretability problem has been hindering the use of ML in fields like medicine, ecology and insurance, where an understanding of the inner workings of the model is paramount to ensure user acceptance and fairness. The need for interpretable ML models has boosted research in the field of interpretable machine learning (IML). Here we propose a novel approach for the functional decomposition of black-box predictions, which is considered a core concept of IML. The idea of our method is to replace the prediction function by a surrogate model consisting of simpler subfunctions. Similar to additive regression models, these functions provide insights into the direction and strength of the main feature contributions and their interactions. Our method is based on a novel concept termed stacked orthogonality, which ensures that the main effects capture as much functional behavior as possible and do not contain information explained by higher-order interactions. Unlike earlier functional IML approaches, it is neither affected by extrapolation nor by hidden feature interactions. To compute the subfunctions, we propose an algorithm based on neural additive modeling and an efficient post-hoc orthogonalization procedure.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Confidence intervals for tree-structured varying coefficients
Authors:
Nikolai Spuck,
Matthias Schmid,
Malte Monin,
Moritz Berger
Abstract:
The tree-structured varying coefficient model (TSVC) is a flexible regression approach that allows the effects of covariates to vary with the values of the effect modifiers. Relevant effect modifiers are identified inherently using recursive partitioning techniques. To quantify uncertainty in TSVC models, we propose a procedure to construct confidence intervals of the estimated partition-specific…
▽ More
The tree-structured varying coefficient model (TSVC) is a flexible regression approach that allows the effects of covariates to vary with the values of the effect modifiers. Relevant effect modifiers are identified inherently using recursive partitioning techniques. To quantify uncertainty in TSVC models, we propose a procedure to construct confidence intervals of the estimated partition-specific coefficients. This task constitutes a selective inference problem as the coefficients of a TSVC model result from data-driven model building. To account for this issue, we introduce a parametric bootstrap approach, which is tailored to the complex structure of TSVC. Finite sample properties, particularly coverage proportions, of the proposed confidence intervals are evaluated in a simulation study. For illustration, we consider applications to data from COVID-19 patients and from patients suffering from acute odontogenic infection. The proposed approach may also be adapted for constructing confidence intervals for other tree-based methods.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Modeling the Ratio of Correlated Biomarkers Using Copula Regression
Authors:
Moritz Berger,
Nadja Klein,
Michael Wagner,
Matthias Schmid
Abstract:
Modeling the ratio of two dependent components as a function of covariates is a frequently pursued objective in observational research. Despite the high relevance of this topic in medical studies, where biomarker ratios are often used as surrogate endpoints for specific diseases, existing models are based on oversimplified assumptions, assuming e.g.\@ independence or strictly positive associations…
▽ More
Modeling the ratio of two dependent components as a function of covariates is a frequently pursued objective in observational research. Despite the high relevance of this topic in medical studies, where biomarker ratios are often used as surrogate endpoints for specific diseases, existing models are based on oversimplified assumptions, assuming e.g.\@ independence or strictly positive associations between the components. In this paper, we close this gap in the literature and propose a regression model where the marginal distributions of the two components are linked by Frank copula. A key feature of our model is that it allows for both positive and negative correlations between the components, with one of the model parameters being directly interpretable in terms of Kendall's rank correlation coefficient. We study our method theoretically, evaluate finite sample properties in a simulation study and demonstrate its efficacy in an application to diagnosis of Alzheimer's disease via ratios of amyloid-beta and total tau protein biomarkers.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
Detection of nonlinearity, discontinuity and interactions in generalized regression models
Authors:
Nikolai Spuck,
Matthias Schmid,
Moritz Berger
Abstract:
In generalized regression models the effect of continuous covariates is commonly assumed to be linear. This assumption, however, may be too restrictive in applications and may lead to biased effect estimates and decreased predictive ability. While a multitude of alternatives for the flexible modeling of continuous covariates have been proposed, methods that provide guidance for choosing a suitable…
▽ More
In generalized regression models the effect of continuous covariates is commonly assumed to be linear. This assumption, however, may be too restrictive in applications and may lead to biased effect estimates and decreased predictive ability. While a multitude of alternatives for the flexible modeling of continuous covariates have been proposed, methods that provide guidance for choosing a suitable functional form are still limited. To address this issue, we propose a detection algorithm that evaluates several approaches for modeling continuous covariates and guides practitioners to choose the most appropriate alternative. The algorithm utilizes a unified framework for tree-structured modeling which makes the results easily interpretable. We assessed the performance of the algorithm by conducting a simulation study. To illustrate the proposed algorithm, we analyzed data of patients suffering from chronic kidney disease.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
Accounting for Time Dependency in Meta-Analyses of Concordance Probability Estimates
Authors:
Matthias Schmid,
Tim Friede,
Nadja Klein,
Leonie Weinhold
Abstract:
Recent years have seen the development of many novel scoring tools for disease prognosis and prediction. To become accepted for use in clinical applications, these tools have to be validated on external data. In practice, validation is often hampered by logistical issues, resulting in multiple small-sized validation studies. It is therefore necessary to synthesize the results of these studies usin…
▽ More
Recent years have seen the development of many novel scoring tools for disease prognosis and prediction. To become accepted for use in clinical applications, these tools have to be validated on external data. In practice, validation is often hampered by logistical issues, resulting in multiple small-sized validation studies. It is therefore necessary to synthesize the results of these studies using techniques for meta-analysis. Here we consider strategies for meta-analyzing the concordance probability for time-to-event data ("C-index"), which has become a popular tool to evaluate the discriminatory power of prediction models with a right-censored outcome. We show that standard meta-analysis of the C-index may lead to biased results, as the magnitude of the concordance probability depends on the length of the time interval used for evaluation (defined e.g. by the follow-up time, which might differ considerably between studies). To address this issue, we propose a set of methods for random-effects meta-regression that incorporate time directly as covariate in the model equation. In addition to analyzing nonlinear time trends via fractional polynomial, spline, and exponential decay models, we provide recommendations on suitable transformations of the C-index before meta-regression. Our results suggest that the C-index is best meta-analyzed using fractional polynomial meta-regression with logit-transformed C-index values. Classical random-effects meta-analysis (not considering time as covariate) is demonstrated to be a suitable alternative when follow-up times are small. Our findings have implications for the reporting of C-index values in future studies, which should include information on the length of the time interval underlying the calculations.
△ Less
Submitted 3 December, 2022;
originally announced December 2022.
-
Error-Covariance Analysis of Monocular Pose Estimation Using Total Least Squares
Authors:
Saeed Maleki,
John Crassidis,
Yang Cheng,
Matthias Schmid
Abstract:
This study presents a theoretical structure for the monocular pose estimation problem using the total least squares. The unit-vector line-of-sight observations of the features are extracted from the monocular camera images. First, the optimization framework is formulated for the pose estimation problem with observation vectors extracted from unit vectors from the camera center-of-projection, point…
▽ More
This study presents a theoretical structure for the monocular pose estimation problem using the total least squares. The unit-vector line-of-sight observations of the features are extracted from the monocular camera images. First, the optimization framework is formulated for the pose estimation problem with observation vectors extracted from unit vectors from the camera center-of-projection, pointing towards the image features. The attitude and position solutions obtained via the derived optimization framework are proven to reach the Cramér-Rao lower bound under the small angle approximation of the attitude errors. Specifically, The Fisher Information Matrix and the Cramér-Rao bounds are evaluated and compared to the analytical derivations of the error-covariance expressions to rigorously prove the optimality of the estimates. The sensor data for the measurement model is provided through a series of vector observations, and two fully populated noise-covariance matrices are assumed for the body and reference observation data. The inverse of the former matrices appear in terms of a series of weight matrices in the cost function. The proposed solution is simulated in a Monte-Carlo framework with 10,000 samples to validate the error-covariance analysis.
△ Less
Submitted 20 October, 2022;
originally announced October 2022.
-
Optimal Pose Estimation and Covariance Analysis with Simultaneous Localization and Mapping Applications
Authors:
Saeed Maleki,
Adhiti Raman,
Yang Cheng,
John Crassidis,
Matthias Schmid
Abstract:
This work provides a theoretical analysis for optimally solving the pose estimation problem using total least squares for vector observations from landmark features, which is central to applications involving simultaneous localization and mapping. First, the optimization process is formulated with observation vectors extracted from point-cloud features. Then, error-covariance expressions are deriv…
▽ More
This work provides a theoretical analysis for optimally solving the pose estimation problem using total least squares for vector observations from landmark features, which is central to applications involving simultaneous localization and mapping. First, the optimization process is formulated with observation vectors extracted from point-cloud features. Then, error-covariance expressions are derived. The attitude and position estimates obtained via the derived optimization process are proven to reach the bounds defined by the Cramér-Rao lower bound under the small-angle approximation of attitude errors. A fully populated observation noise-covariance matrix is assumed as the weight in the cost function to cover the most general case of the sensor uncertainty. This includes more generic correlations in the errors than previous cases involving an isotropic noise assumption. The proposed solution is verified using Monte Carlo simulations and an experiment with an actual LIDAR to validate the error-covariance analysis.
△ Less
Submitted 20 October, 2022;
originally announced October 2022.
-
Model-based recursive partitioning for discrete event times
Authors:
Cynthia Huber,
Matthias Schmid,
Tim Friede
Abstract:
Model-based recursive partitioning (MOB) is a semi-parametric statistical approach allowing the identification of subgroups that can be combined with a broad range of outcome measures including continuous time-to-event outcomes. When time is measured on a discrete scale, methods and models need to account for this discreetness as otherwise subgroups might be spurious and effects biased. The test u…
▽ More
Model-based recursive partitioning (MOB) is a semi-parametric statistical approach allowing the identification of subgroups that can be combined with a broad range of outcome measures including continuous time-to-event outcomes. When time is measured on a discrete scale, methods and models need to account for this discreetness as otherwise subgroups might be spurious and effects biased. The test underlying the splitting criterion of MOB, the M-fluctuation test, assumes independent observations. However, for fitting discrete time-to-event models the data matrix has to be modified resulting in an augmented data matrix violating the independence assumption. We propose MOB for discrete Survival data (MOB-dS) which controls the type I error rate of the test used for data splitting and therefore the rate of identifying subgroups although none is present. MOB-ds uses a permutation approach accounting for dependencies in the augmented time-to-event data to obtain the distribution under the null hypothesis of no subgroups being present. Through simulations we investigate the type I error rate of the new MOB-dS and the standard MOB for different patterns of survival curves and event rates. We find that the type I error rates of the test is well controlled for MOB-dS, but observe some considerable inflations of the error rate for MOB. To illustrate the proposed methods, MOB-dS is applied to data on unemployment duration.
△ Less
Submitted 14 September, 2022;
originally announced September 2022.
-
Approximate exploitability: Learning a best response in large games
Authors:
Finbarr Timbers,
Nolan Bard,
Edward Lockhart,
Marc Lanctot,
Martin Schmid,
Neil Burch,
Julian Schrittwieser,
Thomas Hubert,
Michael Bowling
Abstract:
Researchers have demonstrated that neural networks are vulnerable to adversarial examples and subtle environment changes, both of which one can view as a form of distribution shift. To humans, the resulting errors can look like blunders, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. While valuable, such evaluation typically…
▽ More
Researchers have demonstrated that neural networks are vulnerable to adversarial examples and subtle environment changes, both of which one can view as a form of distribution shift. To humans, the resulting errors can look like blunders, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. While valuable, such evaluation typically fails to evaluate robustness to worst-case outcomes. Prior research in computer poker has examined how to assess such worst-case performance, both exactly and approximately. Unfortunately, exact computation is infeasible with larger domains, and existing approximations rely on poker-specific knowledge. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, thereby approximating worst-case performance. We demonstrate the technique in several two-player zero-sum games against a variety of agents, including several AlphaZero-based agents.
△ Less
Submitted 3 November, 2022; v1 submitted 20 April, 2020;
originally announced April 2020.
-
Assessing the Calibration of Subdistribution Hazard Models in Discrete Time
Authors:
Moritz Berger,
Matthias Schmid
Abstract:
The generalization performance of a risk prediction model can be evaluated by its calibration, which measures the agreement between predicted and observed outcomes on external validation data. Here, methods for assessing the calibration of discrete time-to-event models in the presence of competing risks are proposed. The methods are designed for the class of discrete subdistribution hazard models,…
▽ More
The generalization performance of a risk prediction model can be evaluated by its calibration, which measures the agreement between predicted and observed outcomes on external validation data. Here, methods for assessing the calibration of discrete time-to-event models in the presence of competing risks are proposed. The methods are designed for the class of discrete subdistribution hazard models, which directly relate the cumulative incidence function of one event of interest to a set of covariates. Simulation studies show that the methods are strong tools for calibration assessment even in scenarios with a high censoring rate and/or a large number of discrete time points. The proposed approaches are illustrated by an analysis of nosocomial pneumonia.
△ Less
Submitted 30 January, 2020;
originally announced January 2020.
-
State-of-the-art in selection of variables and functional forms in multivariable analysis -- outstanding issues
Authors:
Willi Sauerbrei,
Aris Perperoglou,
Matthias Schmid,
Michal Abrahamowicz,
Heiko Becher,
Harald Binder,
Daniela Dunkler,
Frank E. Harrell Jr,
Patrick Royston,
Georg Heinze
Abstract:
How to select variables and identify functional forms for continuous variables is a key concern when creating a multivariable model. Ad hoc 'traditional' approaches to variable selection have been in use for at least 50 years. Similarly, methods for determining functional forms for continuous variables were first suggested many years ago. More recently, many alternative approaches to address these…
▽ More
How to select variables and identify functional forms for continuous variables is a key concern when creating a multivariable model. Ad hoc 'traditional' approaches to variable selection have been in use for at least 50 years. Similarly, methods for determining functional forms for continuous variables were first suggested many years ago. More recently, many alternative approaches to address these two challenges have been proposed, but knowledge of their properties and meaningful comparisons between them are scarce. To define a state-of-the-art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge many outstanding issues in multivariable modelling remain. Our main aims are to identify and illustrate such gaps in the literature and present them at a moderate technical level to the wide community of practitioners, researchers and students of statistics. We briefly discuss general issues in building descriptive regression models, strategies for variable selection, different ways of choosing functional forms for continuous variables, and methods for combining the selection of variables and functions. We discuss two examples, taken from the medical literature, to illustrate problems in the practice of modelling. Our overview revealed that there is not yet enough evidence on which to base recommendations for the selection of variables and functional forms in multivariable analysis. Such evidence may come from comparisons between alternative methods. In particular, we highlight seven important topics that require further investigation and make suggestions for the direction of further research.
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
A Random Forest Approach for Modeling Bounded Outcomes
Authors:
Leonie Weinhold,
Matthias Schmid,
Marvin N. Wright,
Moritz Berger
Abstract:
Random forests have become an established tool for classification and regression, in particular in high-dimensional settings and in the presence of complex predictor-response relationships. For bounded outcome variables restricted to the unit interval, however, classical random forest approaches may severely suffer as they do not account for the heteroscedasticity in the data. A random forest appr…
▽ More
Random forests have become an established tool for classification and regression, in particular in high-dimensional settings and in the presence of complex predictor-response relationships. For bounded outcome variables restricted to the unit interval, however, classical random forest approaches may severely suffer as they do not account for the heteroscedasticity in the data. A random forest approach is proposed for relating beta distributed outcomes to explanatory variables. The approach explicitly makes use of the likelihood function of the beta distribution for the selection of splits during the tree-building procedure. In each iteration of the tree-building algorithm one chooses the combination of explanatory variable and splitting rule that maximizes the log-likelihood function of the beta distribution with the parameter estimates derived from the nodes of the currently built tree. Several simulation studies demonstrate the properties of the method and compare its performance to classical random forest approaches as well as to parametric regression models.
△ Less
Submitted 18 January, 2019;
originally announced January 2019.
-
Correlation-Adjusted Regression Survival Scores for High-Dimensional Variable Selection
Authors:
Thomas Welchowski,
Verena Zuber,
Matthias Schmid
Abstract:
Background: The development of classification methods for personalized medicine is highly dependent on the identification of predictive genetic markers. In survival analysis it is often necessary to discriminate between influential and non-influential markers. Usually, the first step is to perform a univariate screening step that ranks the markers according to their associations with the outcome.…
▽ More
Background: The development of classification methods for personalized medicine is highly dependent on the identification of predictive genetic markers. In survival analysis it is often necessary to discriminate between influential and non-influential markers. Usually, the first step is to perform a univariate screening step that ranks the markers according to their associations with the outcome. It is common to perform screening using Cox scores, which quantify the associations between survival and each of the markers individually. Since Cox scores do not account for dependencies between the markers, their use is suboptimal in the presence highly correlated markers. Methods: As an alternative to the Cox score, we propose the correlation-adjusted regression survival (CARS) score for right-censored survival outcomes. By removing the correlations between the markers, the CARS score quantifies the associations between the outcome and the set of "de-correlated" marker values. Estimation of the scores is based on inverse probability weighting, which is applied to log-transformed event times. For high-dimensional data, estimation is based on shrinkage techniques. Results: The consistency of the CARS score is proven under mild regularity conditions. In simulations, survival models based on CARS score rankings achieved higher areas under the precision-recall curve than competing methods. Two example applications on prostate and breast cancer confirmed these results. CARS scores are implemented in the R package carSurv. Conclusions: In research applications involving high-dimensional genetic data, the use of CARS scores for marker selection is a favorable alternative to Cox scores even when correlations between covariates are low. Having a straightforward interpretation and low computational requirements, CARS scores are an easy-to-use screening tool in personalized medicine research.
△ Less
Submitted 24 February, 2018; v1 submitted 22 February, 2018;
originally announced February 2018.
-
Tree-Structured Modelling of Varying Coefficients
Authors:
Moritz Berger,
Gerhard Tutz,
Matthias Schmid
Abstract:
The varying-coefficient model is a strong tool for the modelling of interactions in generalized regression. It is easy to apply if both the variables that are modified as well as the effect modifiers are known. However, in general one has a set of explanatory variables and it is unknown which variables are modified by which covariates. A recursive partitioning strategy is proposed that is able to…
▽ More
The varying-coefficient model is a strong tool for the modelling of interactions in generalized regression. It is easy to apply if both the variables that are modified as well as the effect modifiers are known. However, in general one has a set of explanatory variables and it is unknown which variables are modified by which covariates. A recursive partitioning strategy is proposed that is able to deal with the complex selection problem. The tree-structured modelling yields for each covariate, which is modified by other variables, a tree that visualizes the modified effects. The performance of the method is investigated in simulations and two applications illustrate its usefulness.
△ Less
Submitted 24 May, 2017;
originally announced May 2017.
-
Semiparametric Regression for Discrete Time-to-Event Data
Authors:
Moritz Berger,
Matthias Schmid
Abstract:
Time-to-event models are a popular tool to analyse data where the outcome variable is the time to the occurrence of a specific event of interest. Here we focus on the analysis of time-to-event outcomes that are either intrisically discrete or grouped versions of continuous event times. In the literature, there exists a variety of regression methods for such data. This tutorial provides an introduc…
▽ More
Time-to-event models are a popular tool to analyse data where the outcome variable is the time to the occurrence of a specific event of interest. Here we focus on the analysis of time-to-event outcomes that are either intrisically discrete or grouped versions of continuous event times. In the literature, there exists a variety of regression methods for such data. This tutorial provides an introduction to how these models can be applied using open source statistical software. In particular, we consider semiparametric extensions comprising the use of smooth nonlinear functions and tree-based methods. All methods are illustrated by data on the duration of unemployment of U.S. citizens.
△ Less
Submitted 13 April, 2017;
originally announced April 2017.
-
An update on statistical boosting in biomedicine
Authors:
Andreas Mayr,
Benjamin Hofner,
Elisabeth Waldmann,
Tobias Hepp,
Olaf Gefeller,
Matthias Schmid
Abstract:
Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine-learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type o…
▽ More
Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine-learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type of effect for the explanatory variables) can be combined with any kind of loss function (target function to be optimized, defining the type of regression setting). In this review article, we highlight the most recent methodological developments on statistical boosting regarding variable selection, functional regression and advanced time-to-event modelling. Additionally, we provide a short overview on relevant applications of statistical boosting in biomedicine.
△ Less
Submitted 27 February, 2017;
originally announced February 2017.
-
Stability selection for component-wise gradient boosting in multiple dimensions
Authors:
Janek Thomas,
Andreas Mayr,
Bernd Bischl,
Matthias Schmid,
Adam Smith,
Benjamin Hofner
Abstract:
We present a new algorithm for boosting generalized additive models for location, scale and shape (GAMLSS) that allows to incorporate stability selection, an increasingly popular way to obtain stable sets of covariates while controlling the per-family error rate (PFER). The model is fitted repeatedly to subsampled data and variables with high selection frequencies are extracted. To apply stability…
▽ More
We present a new algorithm for boosting generalized additive models for location, scale and shape (GAMLSS) that allows to incorporate stability selection, an increasingly popular way to obtain stable sets of covariates while controlling the per-family error rate (PFER). The model is fitted repeatedly to subsampled data and variables with high selection frequencies are extracted. To apply stability selection to boosted GAMLSS, we develop a new "noncyclical" fitting algorithm that incorporates an additional selection step of the best-fitting distribution parameter in each iteration. This new algorithms has the additional advantage that optimizing the tuning parameters of boosting is reduced from a multi-dimensional to a one-dimensional problem with vastly decreased complexity. The performance of the novel algorithm is evaluated in an extensive simulation study. We apply this new algorithm to a study to estimate abundance of common eider in Massachusetts, USA, featuring excess zeros, overdispersion, non-linearity and spatio-temporal structures. Eider abundance is estimated via boosted GAMLSS, allowing both mean and overdispersion to be regressed on covariates. Stability selection is used to obtain a sparse set of stable predictors.
△ Less
Submitted 30 November, 2016;
originally announced November 2016.
-
Boosting Joint Models for Longitudinal and Time-to-Event Data
Authors:
Elisabeth Waldmann,
David Taylor-Robinson,
Nadja Klein,
Thomas Kneib,
Tania Pressler,
Matthias Schmid,
Andreas Mayr
Abstract:
Joint Models for longitudinal and time-to-event data have gained a lot of attention in the last few years as they are a helpful technique to approach common a data structure in clinical studies where longitudinal outcomes are recorded alongside event times. Those two processes are often linked and the two outcomes should thus be modeled jointly in order to prevent the potential bias introduced by…
▽ More
Joint Models for longitudinal and time-to-event data have gained a lot of attention in the last few years as they are a helpful technique to approach common a data structure in clinical studies where longitudinal outcomes are recorded alongside event times. Those two processes are often linked and the two outcomes should thus be modeled jointly in order to prevent the potential bias introduced by independent modelling. Commonly, joint models are estimated in likelihood based expectation maximization or Bayesian approaches using frameworks where variable selection is problematic and which do not immediately work for high-dimensional data. In this paper, we propose a boosting algorithm tackling these challenges by being able to simultaneously estimate predictors for joint models and automatically select the most influential variables even in high-dimensional data situations. We analyse the performance of the new algorithm in a simulation study and apply it to the Danish cystic fibrosis registry which collects longitudinal lung function data on patients with cystic fibrosis together with data regarding the onset of pulmonary infections. This is the first approach to combine state-of-the art algorithms from the field of machine-learning with the model class of joint models, providing a fully data-driven mechanism to select variables and predictor effects in a unified framework of boosting joint models.
△ Less
Submitted 22 December, 2016; v1 submitted 9 September, 2016;
originally announced September 2016.
-
A Statistical Model for the Analysis of Beta Values in DNA Methylation Studies
Authors:
Leonie Weinhold,
Simone Wahl,
Matthias Schmid
Abstract:
Background: The analysis of DNA methylation is a key component in the development of personalized treatment approaches. A common way to measure DNA methylation is the calculation of beta values, which are bounded variables of the form M = (M + U) that are generated by Illumina's 450k BeadChip array. The statistical analysis of beta values is considered to be challenging, as traditional methods for…
▽ More
Background: The analysis of DNA methylation is a key component in the development of personalized treatment approaches. A common way to measure DNA methylation is the calculation of beta values, which are bounded variables of the form M = (M + U) that are generated by Illumina's 450k BeadChip array. The statistical analysis of beta values is considered to be challenging, as traditional methods for the analysis of bounded variables, such as M-value regression and beta regression, are based on regularity assumptions that are often too strong to adequately describe the distribution of beta values. Results: We develop a statistical model for the analysis of beta values that is derived from a bivariate gamma distribution for the signal intensities M and U. By allowing for possible correlations between M and U, the proposed model explicitly takes into account the data-generating process underlying the calculation of beta values. Conclusion: The proposed model can be used to improve the identification of associations between beta values and covariates such as clinical variables and lifestyle factors in epigenome-wide association studies. It is as easy to apply to a sample of beta values as beta regression and M-value regression.
△ Less
Submitted 24 July, 2016;
originally announced July 2016.
-
On the use of Harrell's C for clinical risk prediction via random survival forests
Authors:
Matthias Schmid,
Marvin Wright,
Andreas Ziegler
Abstract:
Random survival forests (RSF) are a powerful method for risk prediction of right-censored outcomes in biomedical research. RSF use the log-rank split criterion to form an ensemble of survival trees. The most common approach to evaluate the prediction accuracy of a RSF model is Harrell's concordance index for survival data ('C index'). Conceptually, this strategy implies that the split criterion in…
▽ More
Random survival forests (RSF) are a powerful method for risk prediction of right-censored outcomes in biomedical research. RSF use the log-rank split criterion to form an ensemble of survival trees. The most common approach to evaluate the prediction accuracy of a RSF model is Harrell's concordance index for survival data ('C index'). Conceptually, this strategy implies that the split criterion in RSF is different from the evaluation criterion of interest. This discrepancy can be overcome by using Harrell's C for both node splitting and evaluation. We compare the difference between the two split criteria analytically and in simulation studies with respect to the preference of more unbalanced splits, termed end-cut preference (ECP). Specifically, we show that the log-rank statistic has a stronger ECP compared to the C index. In simulation studies and with the help of two medical data sets we demonstrate that the accuracy of RSF predictions, as measured by Harrell's C, can be improved if the log-rank statistic is replaced by the C index for node splitting. This is especially true in situations where the censoring rate or the fraction of informative continuous predictor variables is high. Conversely, log-rank splitting is preferable in noisy scenarios. Both C-based and log-rank splitting are implemented in the R~package ranger. We recommend Harrell's C as split criterion for use in smaller scale clinical studies and the log-rank split criterion for use in large-scale 'omics' studies.
△ Less
Submitted 18 July, 2016; v1 submitted 11 July, 2015;
originally announced July 2015.
-
gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework
Authors:
Benjamin Hofner,
Andreas Mayr,
Matthias Schmid
Abstract:
Generalized additive models for location, scale and shape (GAMLSS) are a flexible class of regression models that allow to model multiple parameters of a distribution function, such as the mean and the standard deviation, simultaneously. With the R package gamboostLSS, we provide a boosting method to fit these models. Variable selection and model choice are naturally available within this regulari…
▽ More
Generalized additive models for location, scale and shape (GAMLSS) are a flexible class of regression models that allow to model multiple parameters of a distribution function, such as the mean and the standard deviation, simultaneously. With the R package gamboostLSS, we provide a boosting method to fit these models. Variable selection and model choice are naturally available within this regularized regression framework. To introduce and illustrate the R package gamboostLSS and its infrastructure, we use a data set on stunted growth in India. In addition to the specification and application of the model itself, we present a variety of convenience functions, including methods for tuning parameter selection, prediction and visualization of results. The package gamboostLSS is available from CRAN (http://cran.r-project.org/package=gamboostLSS).
△ Less
Submitted 7 July, 2014;
originally announced July 2014.
-
Extending Statistical Boosting - An Overview of Recent Methodological Developments
Authors:
Andreas Mayr,
Harald Binder,
Olaf Gefeller,
Matthias Schmid
Abstract:
Boosting algorithms to simultaneously estimate and select predictor effects in statistical models have gained substantial interest during the last decade. This review article aims to highlight recent methodological developments regarding boosting algorithms for statistical modelling especially focusing on topics relevant for biomedical research. We suggest a unified framework for gradient boosting…
▽ More
Boosting algorithms to simultaneously estimate and select predictor effects in statistical models have gained substantial interest during the last decade. This review article aims to highlight recent methodological developments regarding boosting algorithms for statistical modelling especially focusing on topics relevant for biomedical research. We suggest a unified framework for gradient boosting and likelihood-based boosting (statistical boosting) which have been addressed strictly separated in the literature up to now. Statistical boosting algorithms have been adapted to carry out unbiased variable selection and automated model choice during the fitting process and can nowadays be applied in almost any possible type of regression setting in combination with a large amount of different types of predictor effects. The methodological developments on statistical boosting during the last ten years can be grouped into three different lines of research: (i) efforts to ensure variable selection leading to sparser models, (ii) developments regarding different types of predictor effects and their selection (model choice), (iii) approaches to extend the statistical boosting framework to new regression settings.
△ Less
Submitted 18 November, 2014; v1 submitted 7 March, 2014;
originally announced March 2014.
-
The Evolution of Boosting Algorithms - From Machine Learning to Statistical Modelling
Authors:
Andreas Mayr,
Harald Binder,
Olaf Gefeller,
Matthias Schmid
Abstract:
The concept of boosting emerged from the field of machine learning. The basic idea is to boost the accuracy of a weak classifying tool by combining various instances into a more accurate prediction. This general concept was later adapted to the field of statistical modelling. This review article attempts to highlight this evolution of boosting algorithms from machine learning to statistical modell…
▽ More
The concept of boosting emerged from the field of machine learning. The basic idea is to boost the accuracy of a weak classifying tool by combining various instances into a more accurate prediction. This general concept was later adapted to the field of statistical modelling. This review article attempts to highlight this evolution of boosting algorithms from machine learning to statistical modelling. We describe the AdaBoost algorithm for classification as well as the two most prominent statistical boosting approaches, gradient boosting and likelihood-based boosting. Although both appraoches are typically treated separately in the literature, they share the same methodological roots and follow the same fundamental concepts. Compared to the initial machine learning algorithms, which must be seen as black-box prediction schemes, statistical boosting result in statistical models which offer a straight-forward interpretation. We highlight the methodological background and present the most common software implementations. Worked out examples and corresponding R code can be found in the Appendix.
△ Less
Submitted 18 November, 2014; v1 submitted 6 March, 2014;
originally announced March 2014.
-
Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations
Authors:
Andreas Mayr,
Matthias Schmid
Abstract:
The development of molecular signatures for the prediction of time-to-event outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, e…
▽ More
The development of molecular signatures for the prediction of time-to-event outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, estimation and evaluation steps. This might result in marker combinations that are only suboptimal regarding the evaluation criterion of interest. To address this issue, we propose a unified framework to derive and evaluate biomarker combinations. Our approach is based on the concordance index for time-to-event data, which is a non-parametric measure to quantify the discrimatory power of a prediction rule. Specifically, we propose a component-wise boosting algorithm that results in linear biomarker combinations that are optimal with respect to a smoothed version of the concordance index. We investigate the performance of our algorithm in a large-scale simulation study and in two molecular data sets for the prediction of survival in breast cancer patients. Our numerical results show that the new approach is not only methodologically sound but can also lead to a higher discriminatory power than traditional approaches for the derivation of gene signatures.
△ Less
Submitted 25 October, 2013; v1 submitted 24 July, 2013;
originally announced July 2013.