-
Observable asymptotics of regularized Cox regression models with standard Gaussian designs: a statistical mechanics approach
Authors:
Emanuele Massa,
Anthony Coolen
Abstract:
We study the asymptotic behaviour of the Regularized Maximum Partial Likelihood Estimator (RMPLE) in the proportional limit, considering an arbitrary convex regularizer and assuming that the covariates $\mathbf{X}_i\in\mathbb{R}^{p}$ follow a multivariate Gaussian law with covariance $\mathbf{I}_p/p$ for each $i=1, \dots, n$. In order to efficiently compute the estimator under investigation, we pr…
▽ More
We study the asymptotic behaviour of the Regularized Maximum Partial Likelihood Estimator (RMPLE) in the proportional limit, considering an arbitrary convex regularizer and assuming that the covariates $\mathbf{X}_i\in\mathbb{R}^{p}$ follow a multivariate Gaussian law with covariance $\mathbf{I}_p/p$ for each $i=1, \dots, n$. In order to efficiently compute the estimator under investigation, we propose a modified Approximate Message Passing (AMP) algorithm, that we name COX-AMP, and compare its performance with the Coordinate-wise Descent (CD) algorithm, which is taken as reference. By means of the Replica method, we derive a set of six Replica Symmetric (RS) equations that we show to correctly describe the average behaviour of the estimators when the sample size and the number of covariates is large and commensurate. These equations cannot be solved in practice, as the data generating process (that we are trying to estimate) is not known. However, the update equations of COX-AMP suggest the construction of a local field that can in turn be used to accurately estimate all the RS order parameters of the theory \emph{solely from the data}, \emph{without} actually solving the RS equations. We emphasize that this approach can be applied when the estimator is computed via any method and is not restricted to COX-AMP. Once the RS order parameters are estimated, we have access to the amount of signal and noise in the RMPLE, but also its generalization error, directly from the data. Although we focus on the Partial Likelihood objective, we envisage broader application of the methodology proposed here, for instance to GLMs with nuisance parameters, which include some non-proportional hazards models, e.g. Accelerated Failure Time models.
△ Less
Submitted 6 February, 2025; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Replica analysis of overfitting in regression models for time to event data: the impact of censoring
Authors:
Emanuele Massa,
Alexander Mozeika,
Anthony Coolen
Abstract:
We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox's proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclus…
▽ More
We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox's proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclusion of censoring was avoided in previous overfitting analyses for mathematical convenience, but is vital to make any theory applicable to real-world medical data, where censoring is ubiquitous. Upon constructing efficient algorithms for solving the new (and more complex) RS equations and comparing the solutions with numerical simulation data, we find excellent agreement, even for large censoring rates. We then address the practical problem of using the theory to correct the biased ML estimators {without} knowledge of the data-generating distribution. This is achieved via a novel numerical algorithm that self-consistently approximates all relevant parameters of the data generating distribution while simultaneously solving the RS equations. We investigate numerically the statistics of the corrected estimators, and show that the proposed new algorithm indeed succeeds in removing the bias of the ML estimators, for both the association parameters and for the cumulative hazard.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Penalization-induced shrinking without rotation in high dimensional GLM regression: a cavity analysis
Authors:
Emanuele Massa,
Marianne Jonker,
Anthony Coolen
Abstract:
In high dimensional regression, where the number of covariates is of the order of the number of observations, ridge penalization is often used as a remedy against overfitting. Unfortunately, for correlated covariates such regularisation typically induces in generalized linear models not only shrinking of the estimated parameter vector, but also an unwanted \emph{rotation} relative to the true vect…
▽ More
In high dimensional regression, where the number of covariates is of the order of the number of observations, ridge penalization is often used as a remedy against overfitting. Unfortunately, for correlated covariates such regularisation typically induces in generalized linear models not only shrinking of the estimated parameter vector, but also an unwanted \emph{rotation} relative to the true vector. We show analytically how this problem can be removed by using a generalization of ridge penalization, and we analyse the asymptotic properties of the corresponding estimators in the high dimensional regime, using the cavity method. Our results also provide a quantitative rationale for tuning the parameter that controlling the amount of shrinking. We compare our theoretical predictions with simulated data and find excellent agreement.
△ Less
Submitted 9 September, 2022;
originally announced September 2022.
-
Correction of overfitting bias in regression models
Authors:
Emanuele Massa,
Marianne Jonker,
Kit Roes,
Anthony Coolen
Abstract:
Regression analysis based on many covariates is becoming increasingly common. However, when the number of covariates $p$ is of the same order as the number of observations $n$, maximum likelihood regression becomes unreliable due to overfitting. This typically leads to systematic estimation biases and increased estimator variances. It is crucial for inference and prediction to quantify these effec…
▽ More
Regression analysis based on many covariates is becoming increasingly common. However, when the number of covariates $p$ is of the same order as the number of observations $n$, maximum likelihood regression becomes unreliable due to overfitting. This typically leads to systematic estimation biases and increased estimator variances. It is crucial for inference and prediction to quantify these effects correctly. Several methods have been proposed in literature to overcome overfitting bias or adjust estimates. The vast majority of these focus on the regression parameters. But failure to estimate correctly also the nuisance parameters may lead to significant errors in confidence statements and outcome prediction.
In this paper we present a jacknife method for deriving a compact set of non-linear equations which describe the statistical properties of the ML estimator in the regime where $p=O(n)$ and under the hypothesis of normally distributed covariates. These equations enable one to compute the overfitting bias of maximum likelihood (ML) estimators in parametric regression models as functions of $ζ= p/n$. We then use these equations to compute shrinkage factors in order to remove the overfitting bias of maximum likelihood (ML) estimators. This new derivation offers various benefits over the replica approach in terms of increased transparency and reduced assumptions. To illustrate the theory we performed simulation studies for multiple regression models. In all cases we find excellent agreement between theory and simulations.
△ Less
Submitted 4 September, 2023; v1 submitted 12 April, 2022;
originally announced April 2022.
-
Exact results on high-dimensional linear regression via statistical physics
Authors:
Alexander Mozeika,
Mansoor Sheikh,
Fabian Aguirre-Lopez,
Fabrizio Antenucci,
Anthony CC Coolen
Abstract:
It is clear that conventional statistical inference protocols need to be revised to deal correctly with the high-dimensional data that are now common. Most recent studies aimed at achieving this revision rely on powerful approximation techniques, that call for rigorous results against which they can be tested. In this context, the simplest case of high-dimensional linear regression has acquired si…
▽ More
It is clear that conventional statistical inference protocols need to be revised to deal correctly with the high-dimensional data that are now common. Most recent studies aimed at achieving this revision rely on powerful approximation techniques, that call for rigorous results against which they can be tested. In this context, the simplest case of high-dimensional linear regression has acquired significant new relevance and attention. In this paper we use the statistical physics perspective on inference to derive a number of new exact results for linear regression in the high-dimensional regime.
△ Less
Submitted 6 April, 2021; v1 submitted 28 September, 2020;
originally announced September 2020.
-
A monotonicity property of weighted log-rank tests
Authors:
Tahani Coolen-Maturi,
Frank P. A. Coolen
Abstract:
The logrank test is a well-known nonparametric test which is often used to compare the survival distributions of two samples including right censored observations, it is also known as the Mantel-Haenszel test. The $G^ρ$ family of tests, introduced by Harrington and Fleming (1982), generalizes the logrank test by using weights assigned to observations. In this paper, we present a monotonicity prope…
▽ More
The logrank test is a well-known nonparametric test which is often used to compare the survival distributions of two samples including right censored observations, it is also known as the Mantel-Haenszel test. The $G^ρ$ family of tests, introduced by Harrington and Fleming (1982), generalizes the logrank test by using weights assigned to observations. In this paper, we present a monotonicity property for the $G^ρ$ family of tests, which was motivated by the need to derive bounds for the test statistic in case of imprecise data observations.
△ Less
Submitted 3 August, 2020;
originally announced August 2020.
-
The joint survival signature of coherent systems with shared components
Authors:
Tahani Coolen-Maturi,
Frank P. A. Coolen,
Narayanaswamy Balakrishnan
Abstract:
The concept of joint bivariate signature, introduced by Navarro et al. (2013), is a useful tool for quantifying the reliability of two systems with shared components. As with the univariate system signature, introduced by Samaniego (2007), its applications are limited to systems with only one type of components, which restricts its practical use. Coolen and Coolen-Maturi (2012) introduced the surv…
▽ More
The concept of joint bivariate signature, introduced by Navarro et al. (2013), is a useful tool for quantifying the reliability of two systems with shared components. As with the univariate system signature, introduced by Samaniego (2007), its applications are limited to systems with only one type of components, which restricts its practical use. Coolen and Coolen-Maturi (2012) introduced the survival signature, which generalizes Samaniego's signature and can be used for systems with multiple types of components. This paper introduces a joint survival signature for multiple systems with multiple types of components and with some components shared between systems. A particularly important feature is that the functioning of these systems can be considered at different times, enabling computation of relevant conditional probabilities with regard to a system's functioning conditional on the status of another system with which it shares components. Several opportunities for practical application and related challenges for further development of the presented concept are briefly discussed, setting out an important direction for future research.
△ Less
Submitted 11 August, 2020; v1 submitted 4 July, 2020;
originally announced July 2020.
-
Replica analysis of overfitting in generalized linear models
Authors:
ACC Coolen,
M Sheikh,
A Mozeika,
F Aguirre-Lopez,
F Antenucci
Abstract:
Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottlene…
▽ More
Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfitting errors are not just noise but take mostly the form of a bias, and how with the replica method from statistical physics once can model and predict this bias and the noise statistics. Here we extend our approach to arbitrary generalized linear regression models (GLM), with possibly correlated covariates. We analyse overfitting in ML/MAP inference without having to specify data types or regression models, relying only on the GLM form, and derive generic order parameter equations for the case of $L2$ priors. Second, we derive the probabilistic relationship between true and inferred regression coefficients in GLMs, and show that, for the relevant hyperparameter scaling and correlated covariates, the $L2$ regularization causes a predictable direction change of the coefficient vector. Our results, illustrated by application to linear, logistic, and Cox regression, enable one to correct ML and MAP inferences in GLMs systematically for overfitting bias, and thus extend their applicability into the hitherto forbidden regime $p=O(N)$.
△ Less
Submitted 8 July, 2020; v1 submitted 14 April, 2020;
originally announced April 2020.
-
Analysis of overfitting in the regularized Cox model
Authors:
M Sheikh,
A. C. C. Coolen
Abstract:
The Cox proportional hazards model is ubiquitous in the analysis of time-to-event data. However, when the data dimension p is comparable to the sample size $N$, maximum likelihood estimates for its regression parameters are known to be biased or break down entirely due to overfitting. This prompted the introduction of the so-called regularized Cox model. In this paper we use the replica method fro…
▽ More
The Cox proportional hazards model is ubiquitous in the analysis of time-to-event data. However, when the data dimension p is comparable to the sample size $N$, maximum likelihood estimates for its regression parameters are known to be biased or break down entirely due to overfitting. This prompted the introduction of the so-called regularized Cox model. In this paper we use the replica method from statistical physics to investigate the relationship between the true and inferred regression parameters in regularized multivariate Cox regression with L2 regularization, in the regime where both p and N are large but with p/N ~ O(1). We thereby generalize a recent study from maximum likelihood to maximum a posteriori inference. We also establish a relationship between the optimal regularization parameter and p/N, allowing for straightforward overfitting corrections in time-to-event analysis.
△ Less
Submitted 25 July, 2019; v1 submitted 14 April, 2019;
originally announced April 2019.
-
Accurate Bayesian Data Classification without Hyperparameter Cross-validation
Authors:
M Sheikh,
A C C Coolen
Abstract:
We extend the standard Bayesian multivariate Gaussian generative data classifier by considering a generalization of the conjugate, normal-Wishart prior distribution and by deriving the hyperparameters analytically via evidence maximization. The behaviour of the optimal hyperparameters is explored in the high-dimensional data regime. The classification accuracy of the resulting generalized model is…
▽ More
We extend the standard Bayesian multivariate Gaussian generative data classifier by considering a generalization of the conjugate, normal-Wishart prior distribution and by deriving the hyperparameters analytically via evidence maximization. The behaviour of the optimal hyperparameters is explored in the high-dimensional data regime. The classification accuracy of the resulting generalized model is competitive with state-of-the art Bayesian discriminant analysis methods, but without the usual computational burden of cross-validation.
△ Less
Submitted 28 December, 2017;
originally announced December 2017.
-
Covariate dimension reduction for survival data via the Gaussian process latent variable model
Authors:
James E. Barrett,
Anthony C. C. Coolen
Abstract:
The analysis of high dimensional survival data is challenging, primarily due to the problem of overfitting which occurs when spurious relationships are inferred from data that subsequently fail to exist in test data. Here we propose a novel method of extracting a low dimensional representation of covariates in survival data by combining the popular Gaussian Process Latent Variable Model (GPLVM) wi…
▽ More
The analysis of high dimensional survival data is challenging, primarily due to the problem of overfitting which occurs when spurious relationships are inferred from data that subsequently fail to exist in test data. Here we propose a novel method of extracting a low dimensional representation of covariates in survival data by combining the popular Gaussian Process Latent Variable Model (GPLVM) with a Weibull Proportional Hazards Model (WPHM). The combined model offers a flexible non-linear probabilistic method of detecting and extracting any intrinsic low dimensional structure from high dimensional data. By reducing the covariate dimension we aim to diminish the risk of overfitting and increase the robustness and accuracy with which we infer relationships between covariates and survival outcomes. In addition, we can simultaneously combine information from multiple data sources by expressing multiple datasets in terms of the same low dimensional space. We present results from several simulation studies that illustrate a reduction in overfitting and an increase in predictive performance, as well as successful detection of intrinsic dimensionality. We provide evidence that it is advantageous to combine dimensionality reduction with survival outcomes rather than performing unsupervised dimensionality reduction on its own. Finally, we use our model to analyse experimental gene expression data and detect and extract a low dimensional representation that allows us to distinguish high and low risk groups with superior accuracy compared to doing regression on the original high dimensional data.
△ Less
Submitted 27 January, 2016; v1 submitted 3 June, 2014;
originally announced June 2014.
-
Gaussian process regression for survival data with competing risks
Authors:
James E. Barrett,
Anthony C. C. Coolen
Abstract:
We apply Gaussian process (GP) regression, which provides a powerful non-parametric probabilistic method of relating inputs to outputs, to survival data consisting of time-to-event and covariate measurements. In this context, the covariates are regarded as the `inputs' and the event times are the `outputs'. This allows for highly flexible inference of non-linear relationships between covariates an…
▽ More
We apply Gaussian process (GP) regression, which provides a powerful non-parametric probabilistic method of relating inputs to outputs, to survival data consisting of time-to-event and covariate measurements. In this context, the covariates are regarded as the `inputs' and the event times are the `outputs'. This allows for highly flexible inference of non-linear relationships between covariates and event times. Many existing methods, such as the ubiquitous Cox proportional hazards model, focus primarily on the hazard rate which is typically assumed to take some parametric or semi-parametric form. Our proposed model belongs to the class of accelerated failure time models where we focus on directly characterising the relationship between covariates and event times without any explicit assumptions on what form the hazard rates take. It is straightforward to include various types and combinations of censored and truncated observations. We apply our approach to both simulated and experimental data. We then apply multiple output GP regression, which can handle multiple potentially correlated outputs for each input, to competing risks survival data where multiple event types can occur. By tuning one of the model parameters we can control the extent to which the multiple outputs (the time-to-event for each risk) are dependent thus allowing the specification of correlated risks. Simulation studies suggest that in some cases assuming dependence can lead to more accurate predictions.
△ Less
Submitted 5 September, 2014; v1 submitted 5 December, 2013;
originally announced December 2013.