-
Linear Regression Using Hilbert-Space-Valued Covariates with Unknown Reproducing Kernel
Authors:
Xinyi Li,
Margaret Hoch,
Michael R. Kosorok
Abstract:
We present a new method of linear regression based on principal components using Hilbert-space-valued covariates with unknown reproducing kernels. We develop a computationally efficient approach to estimation and derive asymptotic theory for the regression parameter estimates under mild assumptions. We demonstrate the approach in simulation studies as well as in data analysis using two-dimensional…
▽ More
We present a new method of linear regression based on principal components using Hilbert-space-valued covariates with unknown reproducing kernels. We develop a computationally efficient approach to estimation and derive asymptotic theory for the regression parameter estimates under mild assumptions. We demonstrate the approach in simulation studies as well as in data analysis using two-dimensional brain images as predictors.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Uncertainty quantification for intervals
Authors:
Carlos García Meixide,
Michael R. Kosorok,
Marcos Matabuena
Abstract:
Data following an interval structure are increasingly prevalent in many scientific applications. In medicine, clinical events are often monitored between two clinical visits, making the exact time of the event unknown and generating outcomes with a range format. As interest in automating healthcare decisions grows, uncertainty quantification via predictive regions becomes essential for developing…
▽ More
Data following an interval structure are increasingly prevalent in many scientific applications. In medicine, clinical events are often monitored between two clinical visits, making the exact time of the event unknown and generating outcomes with a range format. As interest in automating healthcare decisions grows, uncertainty quantification via predictive regions becomes essential for developing reliable and trustworthy predictive algorithms. However, the statistical literature currently lacks a general methodology for interval targets, especially when these outcomes are incomplete due to censoring. We propose an uncertainty quantification algorithm for interval responses and establish its theoretical properties using empirical process arguments based on a newly developed class of functions specifically designed for these interval data structures. Although this paper primarily focuses on deriving predictive regions for interval-censored data, the approach can also be applied to other statistical modeling tasks, such as goodness-of-fit assessments. Finally, the applicability of the method is demonstrated through simulations, showing up to a 60\% improvement in conditional coverage. Our new algorithm is also applied to various biomedical contexts, including two clinical examples: i) sleep duration and its association with cardiovascular diseases, and ii) survival time in relation to physical activity levels.
△ Less
Submitted 30 March, 2025; v1 submitted 29 August, 2024;
originally announced August 2024.
-
Inference for change-plane regression
Authors:
Chaeryon Kang,
Hunyong Cho,
Rui Song,
Moulinath Banerjee,
Eric B. Laber,
Michael R. Kosorok
Abstract:
A key challenge in analyzing the behavior of change-plane estimators is that the objective function has multiple minimizers. Two estimators are proposed to deal with this non-uniqueness. For each estimator, an n-rate of convergence is established, and the limiting distribution is derived. Based on these results, we provide a parametric bootstrap procedure for inference. The validity of our theoret…
▽ More
A key challenge in analyzing the behavior of change-plane estimators is that the objective function has multiple minimizers. Two estimators are proposed to deal with this non-uniqueness. For each estimator, an n-rate of convergence is established, and the limiting distribution is derived. Based on these results, we provide a parametric bootstrap procedure for inference. The validity of our theoretical results and the finite sample performance of the bootstrap are demonstrated through simulation experiments. We illustrate the proposed methods to latent subgroup identification in precision medicine using the ACTG175 AIDS study data.
△ Less
Submitted 13 January, 2024; v1 submitted 13 June, 2022;
originally announced June 2022.
-
Discussion of Multiscale Fisher's Independence Test for Multivariate Dependence
Authors:
Duyeol Lee,
Helal El-Zaatari,
Michael R. Kosorok,
Xinyi Li,
Kai Zhang
Abstract:
The multiscale Fisher's independence test (MULTIFIT hereafter) proposed by Gorsky & Ma (2022) is a novel method to test independence between two random vectors. By its design, this test is particularly useful in detecting local dependence. Moreover, by adopting a resampling-free approach, it can easily accommodate massive sample sizes. Another benefit of the proposed method is its ability to inter…
▽ More
The multiscale Fisher's independence test (MULTIFIT hereafter) proposed by Gorsky & Ma (2022) is a novel method to test independence between two random vectors. By its design, this test is particularly useful in detecting local dependence. Moreover, by adopting a resampling-free approach, it can easily accommodate massive sample sizes. Another benefit of the proposed method is its ability to interpret the nature of dependency. We congratulate the authors, Shai Gorksy and Li Ma, for their very interesting and elegant work. In this comment, we would like to discuss a general framework unifying the MULTIFIT and other tests and compare it with the binary expansion randomized ensemble test (BERET hereafter) proposed by Lee et al. (In press). We also would like to contribute our thoughts on potential extensions of the method.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
Kernel Assisted Learning for Personalized Dose Finding
Authors:
Liangyu Zhu,
Wenbin Lu,
Michael R. Kosorok,
Rui Song
Abstract:
An individualized dose rule recommends a dose level within a continuous safe dose range based on patient level information such as physical conditions, genetic factors and medication histories. Traditionally, personalized dose finding process requires repeating clinical visits of the patient and frequent adjustments of the dosage. Thus the patient is constantly exposed to the risk of underdosing a…
▽ More
An individualized dose rule recommends a dose level within a continuous safe dose range based on patient level information such as physical conditions, genetic factors and medication histories. Traditionally, personalized dose finding process requires repeating clinical visits of the patient and frequent adjustments of the dosage. Thus the patient is constantly exposed to the risk of underdosing and overdosing during the process. Statistical methods for finding an optimal individualized dose rule can lower the costs and risks for patients. In this article, we propose a kernel assisted learning method for estimating the optimal individualized dose rule. The proposed methodology can also be applied to all other continuous decision-making problems. Advantages of the proposed method include robustness to model misspecification and capability of providing statistical inference for the estimated parameters. In the simulation studies, we show that this method is capable of identifying the optimal individualized dose rule and produces favorable expected outcomes in the population. Finally, we illustrate our approach using data from a warfarin dosing study for thrombosis patients.
△ Less
Submitted 19 July, 2020;
originally announced July 2020.
-
The Binary Expansion Randomized Ensemble Test (BERET)
Authors:
Duyeol Lee,
Kai Zhang,
Michael R. Kosorok
Abstract:
Recently, the binary expansion testing framework was introduced to test the independence of two continuous random variables by utilizing symmetry statistics that are complete sufficient statistics for dependence. We develop a new test based on an ensemble approach that uses the sum of squared symmetry statistics and distance correlation. Simulation studies suggest that this method improves the pow…
▽ More
Recently, the binary expansion testing framework was introduced to test the independence of two continuous random variables by utilizing symmetry statistics that are complete sufficient statistics for dependence. We develop a new test based on an ensemble approach that uses the sum of squared symmetry statistics and distance correlation. Simulation studies suggest that this method improves the power while preserving the clear interpretation of the binary expansion testing. We extend this method to tests of independence of random vectors in arbitrary dimension. Through random projections, the proposed binary expansion randomized ensemble test transforms the multivariate independence testing problem into a univariate problem. Simulation studies and data example analyses show that the proposed method provides relatively robust performance compared with existing methods.
△ Less
Submitted 7 January, 2021; v1 submitted 8 December, 2019;
originally announced December 2019.
-
Efficient estimation of accelerated lifetime models under length-biased sampling
Authors:
Pourab Roy,
Jason P. Fine,
Michael R. Kosorok
Abstract:
In prevalent cohort studies where subjects are recruited at a cross-section, the time to an event may be subject to length-biased sampling, with the observed data being either the forward recurrence time, or the backward recurrence time, or their sum. In the regression setting, it has been shown that the accelerated failure time model for the underlying event time is invariant under these observed…
▽ More
In prevalent cohort studies where subjects are recruited at a cross-section, the time to an event may be subject to length-biased sampling, with the observed data being either the forward recurrence time, or the backward recurrence time, or their sum. In the regression setting, it has been shown that the accelerated failure time model for the underlying event time is invariant under these observed data set-ups and can be fitted using standard methodology for accelerated failure time model estimation, ignoring the length-bias. However, the efficiency of these estimators is unclear, owing to the fact that the observed covariate distribution, which is also length-biased, may contain information about the regression parameter in the accelerated life model. We demonstrate that if the true covariate distribution is completely unspecified, then the naive estimator based on the conditional likelihood given the covariates is fully efficient.
△ Less
Submitted 4 April, 2019;
originally announced April 2019.
-
Asymptotics for change-point models under varying degrees of mis-specification
Authors:
Rui Song,
Moulinath Banerjee,
Michael R. Kosorok
Abstract:
Change-point models are widely used by statisticians to model drastic changes in the pattern of observed data. Least squares/maximum likelihood based estimation of change-points leads to curious asymptotic phenomena. When the change-point model is correctly specified, such estimates generally converge at a fast rate ($n$) and are asymptotically described by minimizers of jump process. Under comple…
▽ More
Change-point models are widely used by statisticians to model drastic changes in the pattern of observed data. Least squares/maximum likelihood based estimation of change-points leads to curious asymptotic phenomena. When the change-point model is correctly specified, such estimates generally converge at a fast rate ($n$) and are asymptotically described by minimizers of jump process. Under complete mis-specification by a smooth curve, i.e. when a change-point model is fitted to data described by a smooth curve, the rate of convergence slows down to $n^{1/3}$ and the limit distribution changes to that of the minimizer of a continuous Gaussian process. In this paper we provide a bridge between these two extreme scenarios by studying the limit behavior of change-point estimates under varying degrees of model mis-specification by smooth curves, which can be viewed as local alternatives. We find that the limiting regime depends on how quickly the alternatives approach a change-point model. We unravel a family of `intermediate' limits that can transition, at least qualitatively, to the limits in the two extreme scenarios.
△ Less
Submitted 18 October, 2015; v1 submitted 2 September, 2014;
originally announced September 2014.
-
Q-learning with censored data
Authors:
Yair Goldberg,
Michael R. Kosorok
Abstract:
We develop methodology for a multistage decision problem with flexible number of stages in which the rewards are survival times that are subject to censoring. We present a novel Q-learning algorithm that is adjusted for censored data and allows a flexible number of stages. We provide finite sample bounds on the generalization error of the policy learned by the algorithm, and show that when the opt…
▽ More
We develop methodology for a multistage decision problem with flexible number of stages in which the rewards are survival times that are subject to censoring. We present a novel Q-learning algorithm that is adjusted for censored data and allows a flexible number of stages. We provide finite sample bounds on the generalization error of the policy learned by the algorithm, and show that when the optimal Q-function belongs to the approximation space, the expected survival time for policies obtained by the algorithm converges to that of the optimal policy. We simulate a multistage clinical trial with flexible number of stages and apply the proposed censored-Q-learning algorithm to find individualized treatment regimens. The methodology presented in this paper has implications in the design of personalized medicine trials in cancer and in other life-threatening diseases.
△ Less
Submitted 30 May, 2012;
originally announced May 2012.
-
Likelihood based inference for current status data on a grid: A boundary phenomenon and an adaptive inference procedure
Authors:
Runlong Tang,
Moulinath Banerjee,
Michael R. Kosorok
Abstract:
In this paper, we study the nonparametric maximum likelihood estimator for an event time distribution function at a point in the current status model with observation times supported on a grid of potentially unknown sparsity and with multiple subjects sharing the same observation time. This is of interest since observation time ties occur frequently with current status data. The grid resolution is…
▽ More
In this paper, we study the nonparametric maximum likelihood estimator for an event time distribution function at a point in the current status model with observation times supported on a grid of potentially unknown sparsity and with multiple subjects sharing the same observation time. This is of interest since observation time ties occur frequently with current status data. The grid resolution is specified as $cn^{-γ}$ with $c>0$ being a scaling constant and $γ>0$ regulating the sparsity of the grid relative to $n$, the number of subjects. The asymptotic behavior falls into three cases depending on $γ$: regular Gaussian-type asymptotics obtain for $γ<1/3$, nonstandard cube-root asymptotics prevail when $γ>1/3$ and $γ=1/3$ serves as a boundary at which the transition happens. The limit distribution at the boundary is different from either of the previous cases and converges weakly to those obtained with $γ\in(0,1/3)$ and $γ\in(1/3,\infty)$ as $c$ goes to $\infty$ and 0, respectively. This weak convergence allows us to develop an adaptive procedure to construct confidence intervals for the value of the event time distribution at a point of interest without needing to know or estimate $γ$, which is of enormous advantage from the perspective of inference. A simulation study of the adaptive procedure is presented.
△ Less
Submitted 28 May, 2012;
originally announced May 2012.
-
Support Vector Regression for Right Censored Data
Authors:
Yair Goldberg,
Michael R. Kosorok
Abstract:
We develop a unified approach for classification and regression support vector machines for data subject to right censoring. We provide finite sample bounds on the generalization error of the algorithm, prove risk consistency for a wide class of probability measures, and study the associated learning rates. We apply the general methodology to estimation of the (truncated) mean, median, quantiles,…
▽ More
We develop a unified approach for classification and regression support vector machines for data subject to right censoring. We provide finite sample bounds on the generalization error of the algorithm, prove risk consistency for a wide class of probability measures, and study the associated learning rates. We apply the general methodology to estimation of the (truncated) mean, median, quantiles, and for classification problems. We present a simulation study that demonstrates the performance of the proposed approach.
△ Less
Submitted 12 January, 2013; v1 submitted 23 February, 2012;
originally announced February 2012.
-
Simultaneous critical values for $t$-tests in very high dimensions
Authors:
Hongyuan Cao,
Michael R. Kosorok
Abstract:
This article considers the problem of multiple hypothesis testing using $t$-tests. The observed data are assumed to be independently generated conditional on an underlying and unknown two-state hidden model. We propose an asymptotically valid data-driven procedure to find critical values for rejection regions controlling the $k$-familywise error rate ($k$-FWER), false discovery rate (FDR) and the…
▽ More
This article considers the problem of multiple hypothesis testing using $t$-tests. The observed data are assumed to be independently generated conditional on an underlying and unknown two-state hidden model. We propose an asymptotically valid data-driven procedure to find critical values for rejection regions controlling the $k$-familywise error rate ($k$-FWER), false discovery rate (FDR) and the tail probability of false discovery proportion (FDTP) by using one-sample and two-sample $t$-statistics. We only require a finite fourth moment plus some very general conditions on the mean and variance of the population by virtue of the moderate deviations properties of $t$-statistics. A new consistent estimator for the proportion of alternative hypotheses is developed. Simulation studies support our theoretical results and demonstrate that the power of a multiple testing procedure can be substantially improved by using critical values directly, as opposed to the conventional $p$-value approach. Our method is applied in an analysis of the microarray data from a leukemia cancer study that involves testing a large number of hypotheses simultaneously.
△ Less
Submitted 21 February, 2011; v1 submitted 10 February, 2011;
originally announced February 2011.
-
On asymptotically optimal tests under loss of identifiability in semiparametric models
Authors:
Rui Song,
Michael R. Kosorok,
Jason P. Fine
Abstract:
We consider tests of hypotheses when the parameters are not identifiable under the null in semiparametric models, where regularity conditions for profile likelihood theory fail. Exponential average tests based on integrated profile likelihood are constructed and shown to be asymptotically optimal under a weighted average power criterion with respect to a prior on the nonidentifiable aspect of th…
▽ More
We consider tests of hypotheses when the parameters are not identifiable under the null in semiparametric models, where regularity conditions for profile likelihood theory fail. Exponential average tests based on integrated profile likelihood are constructed and shown to be asymptotically optimal under a weighted average power criterion with respect to a prior on the nonidentifiable aspect of the model. These results extend existing results for parametric models, which involve more restrictive assumptions on the form of the alternative than do our results. Moreover, the proposed tests accommodate models with infinite dimensional nuisance parameters which either may not be identifiable or may not be estimable at the usual parametric rate. Examples include tests of the presence of a change-point in the Cox model with current status data and tests of regression parameters in odds-rate models with right censored data. Optimal tests have not previously been studied for these scenarios. We study the asymptotic distribution of the proposed tests under the null, fixed contiguous alternatives and random contiguous alternatives. We also propose a weighted bootstrap procedure for computing the critical values of the test statistics. The optimal tests perform well in simulation studies, where they may exhibit improved power over alternative tests.
△ Less
Submitted 24 August, 2009;
originally announced August 2009.
-
Bootstrapping the Grenander estimator
Authors:
Michael R. Kosorok
Abstract:
The goal of this paper is to study the bootstrap for the Grenander estimator. The first result is a proof of the inconsistency of the nonparametric bootstrap for the Grenander estimator at a given point. The second result is the development and verification of a bootstrap for the $L_1$ confidence band for the Grenander estimator. As part of this work, kernel estimators are studied as alternative…
▽ More
The goal of this paper is to study the bootstrap for the Grenander estimator. The first result is a proof of the inconsistency of the nonparametric bootstrap for the Grenander estimator at a given point. The second result is the development and verification of a bootstrap for the $L_1$ confidence band for the Grenander estimator. As part of this work, kernel estimators are studied as alternatives to the Grenander estimator. We show that when the second derivative of the true density is assumed to be uniformly bounded, there exist kernel estimators with faster convergence rates than the Grenander estimator. We study the implications of this in developing $L_1$ and uniform confidence bands and discuss some open questions.
△ Less
Submitted 16 May, 2008;
originally announced May 2008.
-
The penalized profile sampler
Authors:
Guang Cheng,
Michael R. Kosorok
Abstract:
The penalized profile sampler for semiparametric inference is an extension of the profile sampler method (Lee, Kosorok and Fine, 2005) obtained by profiling a penalized log-likelihood. The idea is to base inference on the posterior distribution obtained by multiplying a profiled penalized log-likelihood by a prior for the parametric component, where the profiling and penalization are applied to…
▽ More
The penalized profile sampler for semiparametric inference is an extension of the profile sampler method (Lee, Kosorok and Fine, 2005) obtained by profiling a penalized log-likelihood. The idea is to base inference on the posterior distribution obtained by multiplying a profiled penalized log-likelihood by a prior for the parametric component, where the profiling and penalization are applied to the nuisance parameter. Because the prior is not applied to the full likelihood, the method is not strictly Bayesian. A benefit of this approximately Bayesian method is that it circumvents the need to put a prior on the possibly infinite-dimensional nuisance components of the model. We investigate the first and second order frequentist performance of the penalized profile sampler, and demonstrate that the accuracy of the procedure can be adjusted by the size of the assigned smoothing parameter. The theoretical validity of the procedure is illustrated for two examples: a partly linear model with normal error for current status data and a semiparametric logistic regression model. As far as we are aware, there are no other methods of inference in this context known to have second order frequentist validity.
△ Less
Submitted 19 January, 2007;
originally announced January 2007.
-
General frequentist properties of the posterior profile distribution
Authors:
Guang Cheng,
Michael R. Kosorok
Abstract:
In this paper, inference for the parametric component of a semiparametric model based on sampling from the posterior profile distribution is thoroughly investigated from the frequentist viewpoint. The higher-order validity of the profile sampler obtained in Cheng and Kosorok [Ann. Statist. 36 (2008)] is extended to semiparametric models in which the infinite dimensional nuisance parameter may no…
▽ More
In this paper, inference for the parametric component of a semiparametric model based on sampling from the posterior profile distribution is thoroughly investigated from the frequentist viewpoint. The higher-order validity of the profile sampler obtained in Cheng and Kosorok [Ann. Statist. 36 (2008)] is extended to semiparametric models in which the infinite dimensional nuisance parameter may not have a root-$n$ convergence rate. This is a nontrivial extension because it requires a delicate analysis of the entropy of the semiparametric models involved. We find that the accuracy of inferences based on the profile sampler improves as the convergence rate of the nuisance parameter increases. Simulation studies are used to verify this theoretical result. We also establish that an exact frequentist confidence interval obtained by inverting the profile log-likelihood ratio can be estimated with higher-order accuracy by the credible set of the same type obtained from the posterior profile distribution. Our theory is verified for several specific examples.
△ Less
Submitted 19 August, 2008; v1 submitted 7 December, 2006;
originally announced December 2006.
-
Higher order semiparametric frequentist inference with the profile sampler
Authors:
Guang Cheng,
Michael R. Kosorok
Abstract:
We consider higher order frequentist inference for the parametric component of a semiparametric model based on sampling from the posterior profile distribution. The first order validity of this procedure established by Lee, Kosorok and Fine in [J. American Statist. Assoc. 100 (2005) 960--969] is extended to second-order validity in the setting where the infinite-dimensional nuisance parameter ac…
▽ More
We consider higher order frequentist inference for the parametric component of a semiparametric model based on sampling from the posterior profile distribution. The first order validity of this procedure established by Lee, Kosorok and Fine in [J. American Statist. Assoc. 100 (2005) 960--969] is extended to second-order validity in the setting where the infinite-dimensional nuisance parameter achieves the parametric rate. Specifically, we obtain higher order estimates of the maximum profile likelihood estimator and of the efficient Fisher information. Moreover, we prove that an exact frequentist confidence interval for the parametric component at level $α$ can be estimated by the $α$-level credible set from the profile sampler with an error of order $O_P(n^{-1})$. Simulation studies are used to assess second-order asymptotic validity of the profile sampler. As far as we are aware, these are the first higher order accuracy results for semiparametric frequentist inference.
△ Less
Submitted 19 August, 2008; v1 submitted 4 May, 2006;
originally announced May 2006.
-
Further details on inference under right censoring for transformation models with a change-point based on a covariate threshold
Authors:
Michael R. Kosorok,
Rui Song
Abstract:
We consider linear transformation models applied to right censored survival data with a change-point based on a covariate threshold. We establish consistency and weak convergence of the nonparametric maximum lieklihood estimators. The change-point parameter is shown to be $n$-consistent, while the remaining parameters are shown to have the expected root-$n$ consistency. We show that the procedur…
▽ More
We consider linear transformation models applied to right censored survival data with a change-point based on a covariate threshold. We establish consistency and weak convergence of the nonparametric maximum lieklihood estimators. The change-point parameter is shown to be $n$-consistent, while the remaining parameters are shown to have the expected root-$n$ consistency. We show that the procedure is adaptive in the sense that the non-threshold parameters are estimable with the same precision as if the true threshold value were known. We also develop Monte-Carlo methods of inference for model parameters and score tests for the existence of a change-point. A key difficulty here is that some of the model parameters are not identifiable under the null hypothesis of no change-point. Simulation students establish the validity of the proposed score tests for finite sample sizes.
△ Less
Submitted 3 April, 2006;
originally announced April 2006.
-
Penalized log-likelihood estimation for partly linear transformation models with current status data
Authors:
Shuangge Ma,
Michael R. Kosorok
Abstract:
We consider partly linear transformation models applied to current status data. The unknown quantities are the transformation function, a linear regression parameter and a nonparametric regression effect. It is shown that the penalized MLE for the regression parameter is asymptotically normal and efficient and converges at the parametric rate, although the penalized MLE for the transformation fu…
▽ More
We consider partly linear transformation models applied to current status data. The unknown quantities are the transformation function, a linear regression parameter and a nonparametric regression effect. It is shown that the penalized MLE for the regression parameter is asymptotically normal and efficient and converges at the parametric rate, although the penalized MLE for the transformation function and nonparametric regression effect are only $n^{1/3}$ consistent. Inference for the regression parameter based on a block jackknife is investigated. We also study computational issues and demonstrate the proposed methodology with a simulation study. The transformation models and partly linear regression terms, coupled with new estimation and inference techniques, provide flexible alternatives to the Cox model for current status data analysis.
△ Less
Submitted 11 February, 2006;
originally announced February 2006.
-
Marginal asymptotics for the "large p, small n" paradigm: with applications to microarray data
Authors:
Michael R. Kosorok,
Shuangge Ma
Abstract:
The "large p, small n" paradigm arises in microarray studies, where expression levels of thousands of genes are monitored for a small number of subjects. There has been an increasing demand for study of asymptotics for the various statistical models and methodologies using genomic data. In this article, we focus on one-sample and two-sample microarray experiments, where the goal is to identify s…
▽ More
The "large p, small n" paradigm arises in microarray studies, where expression levels of thousands of genes are monitored for a small number of subjects. There has been an increasing demand for study of asymptotics for the various statistical models and methodologies using genomic data. In this article, we focus on one-sample and two-sample microarray experiments, where the goal is to identify significantly differentially expressed genes. We establish uniform consistency of certain estimators of marginal distribution functions, sample means and sample medians under the large p small n assumption. We also establish uniform consistency of marginal p-values based on certain asymptotic approximations which permit inference based on false discovery rate techniques. The affects of the normalization process on these results is also investigated. Simulation studies and data analyses are used to assess finite sample performance.
△ Less
Submitted 12 August, 2005;
originally announced August 2005.
-
Robust Inference for Univariate Proportional Hazards Frailty Regression Models
Authors:
Michael R. Kosorok,
Bee Leng Lee,
Jason P. Fine
Abstract:
We consider a class of semiparametric regression models which are one-parameter extensions of the Cox [J. Roy. Statist. Soc. Ser. B 34 (1972) 187-220] model for right-censored univariate failure times. These models assume that the hazard given the covariates and a random frailty unique to each individual has the proportional hazards form multiplied by the frailty.
The frailty is assumed to hav…
▽ More
We consider a class of semiparametric regression models which are one-parameter extensions of the Cox [J. Roy. Statist. Soc. Ser. B 34 (1972) 187-220] model for right-censored univariate failure times. These models assume that the hazard given the covariates and a random frailty unique to each individual has the proportional hazards form multiplied by the frailty.
The frailty is assumed to have mean 1 within a known one-parameter family of distributions. Inference is based on a nonparametric likelihood. The behavior of the likelihood maximizer is studied under general conditions where the fitted model may be misspecified. The joint estimator of the regression and frailty parameters as well as the baseline hazard is shown to be uniformly consistent for the pseudo-value maximizing the asymptotic limit of the likelihood. Appropriately standardized, the estimator converges weakly to a Gaussian process. When the model is correctly specified, the procedure is semiparametric efficient, achieving the semiparametric information bound for all parameter components. It is also proved that the bootstrap gives valid inferences for all parameters, even under misspecification.
We demonstrate analytically the importance of the robust inference in several examples. In a randomized clinical trial, a valid test of the treatment effect is possible when other prognostic factors and the frailty distribution are both misspecified. Under certain conditions on the covariates, the ratios of the regression parameters are still identifiable. The practical utility of the procedure is illustrated on a non-Hodgkin's lymphoma dataset.
△ Less
Submitted 5 October, 2004;
originally announced October 2004.