Search | arXiv e-print repository

Variable Selection in Functional Linear Cox Model

Authors: Yuanzhen Yue, Stella Self, Yichao Wu, Jiajia Zhang, Rahul Ghosal

Abstract: Modern biomedical studies frequently collect complex, high-dimensional physiological signals using wearables and sensors along with time-to-event outcomes, making efficient variable selection methods crucial for interpretation and improving the accuracy of survival models. We propose a novel variable selection method for a functional linear Cox model with multiple functional and scalar covariates… ▽ More Modern biomedical studies frequently collect complex, high-dimensional physiological signals using wearables and sensors along with time-to-event outcomes, making efficient variable selection methods crucial for interpretation and improving the accuracy of survival models. We propose a novel variable selection method for a functional linear Cox model with multiple functional and scalar covariates measured at baseline. We utilize a spline-based semiparametric estimation approach for the functional coefficients and a group minimax concave type penalty (MCP), which effectively integrates smoothness and sparsity into the estimation of functional coefficients. An efficient group descent algorithm is used for optimization, and an automated procedure is provided to select optimal values of the smoothing and sparsity parameters. Through simulation studies, we demonstrate the method's ability to perform accurate variable selection and estimation. The method is applied to 2003-06 cohort of the National Health and Nutrition Examination Survey (NHANES) data, identifying the key temporally varying distributional patterns of physical activity and demographic predictors related to all-cause mortality. Our analysis sheds light on the intricate association between daily distributional patterns of physical activity and all-cause mortality among older US adults. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2204.10882 [pdf]

Cluster Detection Capabilities of the Average Nearest Neighbor Ratio and Ripley's K Function on Areal Data: an Empirical Assessment

Authors: Nadeesha Vidanapathirana, Yuan Wang, Alexander C. McLain, Stella Self

Abstract: Spatial clustering detection methods are widely used in many fields including epidemiology, ecology, biology, physics, and sociology. In these fields, areal data is often of interest; such data may result from spatial aggregation (e.g. the number disease cases in a county) or may be inherent attributes of the areal unit as a whole (e.g. the habitat suitability of conserved land parcel). This study… ▽ More Spatial clustering detection methods are widely used in many fields including epidemiology, ecology, biology, physics, and sociology. In these fields, areal data is often of interest; such data may result from spatial aggregation (e.g. the number disease cases in a county) or may be inherent attributes of the areal unit as a whole (e.g. the habitat suitability of conserved land parcel). This study aims to assess the performance of two spatial clustering detection methods on areal data: the average nearest neighbor (ANN) ratio and Ripley's K function. These methods are designed for point process data, but their ease of implementation in GIS software (e.g., in ESRI ArcGIS) and the lack of analogous methods for areal data have contributed to their use for areal data. Despite the popularity of applying these methods to areal data, little research has explored their properties in the areal data context. In this paper we conduct a simulation study to evaluate the performance of each method for areal data under various areal structures and types of spatial dependence. These studies find that traditional approach to hypothesis testing using the ANN ratio or Ripley's K function results in inflated empirical type I rates when applied to areal data. We demonstrate that this issue can be remedied for both approaches by using Monte Carlo methods which acknowledge the areal nature of the data to estimate the distribution of the test statistic under the null hypothesis. While such an approach is not currently implemented in ArcGIS, it can be easily done in R using code provided by the authors. △ Less

Submitted 13 July, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

arXiv:2204.10852 [pdf, other]

A Generalization of Ripley's K Function for the Detection of Spatial Clustering in Areal Data

Authors: Stella Self, Anna Overby, Anja Zgodic, David White, Alexander McLain, Caitlin Dyckman

Abstract: Spatial clustering detection has a variety of applications in diverse fields, including identifying infectious disease outbreaks, assessing land use patterns, pinpointing crime hotspots, and identifying clusters of neurons in brain imaging applications. While performing spatial clustering analysis on point process data is common, applications to areal data are frequently of interest. For example,… ▽ More Spatial clustering detection has a variety of applications in diverse fields, including identifying infectious disease outbreaks, assessing land use patterns, pinpointing crime hotspots, and identifying clusters of neurons in brain imaging applications. While performing spatial clustering analysis on point process data is common, applications to areal data are frequently of interest. For example, researchers might wish to know if census tracts with a case of a rare medical condition or an outbreak of an infectious disease tend to cluster together spatially. Since few spatial clustering methods are designed for areal data, researchers often reduce the areal data to point process data (e.g., using the centroid of each areal unit) and apply methods designed for point process data, such as Ripley's K function or the average nearest neighbor method. However, since these methods were not designed for areal data, a number of issues can arise. For example, we show that they can result in loss of power and/or a significantly inflated type I error rate. To address these issues, we propose a generalization of Ripley's K function designed specifically to detect spatial clustering in areal data. We compare its performance to that of the traditional Ripley's K function, the average nearest neighbor method, and the spatial scan statistic with an extensive simulation study. We then evaluate the real world performance of the method by using it to detect spatial clustering in land parcels containing conservation easements and US counties with high pediatric overweight/obesity rates. △ Less

Submitted 22 April, 2022; originally announced April 2022.

arXiv:1803.11194 [pdf, ps, other]

doi 10.1002/env.2538

A Large Scale Spatio-temporal Binomial Regression Model for Estimating Seroprevalence Trends

Authors: Stella Watson Self, Christopher McMahan, D. Andrew Brown, Robert Lund, Jenna Gettings, Michael Yabsley

Abstract: This paper develops a large-scale Bayesian spatio-temporal binomial regression model for the purpose of investigating regional trends in antibody prevalence to Borrelia burgdorferi, the causative agent of Lyme disease. The proposed model uses Gaussian predictive processes to estimate the spatially varying trends and a conditional autoregressive model to account for spatio-temporal dependence. Care… ▽ More This paper develops a large-scale Bayesian spatio-temporal binomial regression model for the purpose of investigating regional trends in antibody prevalence to Borrelia burgdorferi, the causative agent of Lyme disease. The proposed model uses Gaussian predictive processes to estimate the spatially varying trends and a conditional autoregressive model to account for spatio-temporal dependence. Careful consideration is made to develop a novel framework that is scalable to large spatio-temporal data. The proposed model is used to analyze approximately 16 million Borrelia burgdorferi test results collected on dogs located throughout the conterminous United States over a sixty month period. This analysis identifies several regions of increasing canine risk. Specifically, this analysis reveals evidence that Lyme disease is getting worse in some endemic regions and that it could potentially be spreading to other non-endemic areas. Further, given the zoonotic nature of this vector-borne disease, this analysis could potentially reveal areas of increasing human risk. △ Less

Submitted 29 March, 2018; originally announced March 2018.

Comments: 19 pages without figures. All figures are available as ancillary files

arXiv:1702.05518 [pdf, other]

doi 10.1080/00031305.2019.1595144

Sampling strategies for fast updating of Gaussian Markov random fields

Authors: D. Andrew Brown, Christopher S. McMahan, Stella Watson Self

Abstract: Gaussian Markov random fields (GMRFs) are popular for modeling dependence in large areal datasets due to their ease of interpretation and computational convenience afforded by the sparse precision matrices needed for random variable generation. Typically in Bayesian computation, GMRFs are updated jointly in a block Gibbs sampler or componentwise in a single-site sampler via the full conditional di… ▽ More Gaussian Markov random fields (GMRFs) are popular for modeling dependence in large areal datasets due to their ease of interpretation and computational convenience afforded by the sparse precision matrices needed for random variable generation. Typically in Bayesian computation, GMRFs are updated jointly in a block Gibbs sampler or componentwise in a single-site sampler via the full conditional distributions. The former approach can speed convergence by updating correlated variables all at once, while the latter avoids solving large matrices. We consider a sampling approach in which the underlying graph can be cut so that conditionally independent sites are updated simultaneously. This algorithm allows a practitioner to parallelize updates of subsets of locations or to take advantage of `vectorized' calculations in a high-level language such as R. Through both simulated and real data, we demonstrate computational savings that can be achieved versus both single-site and block updating, regardless of whether the data are on a regular or an irregular lattice. The approach provides a good compromise between statistical and computational efficiency and is accessible to statisticians without expertise in numerical analysis or advanced computing. △ Less

Submitted 4 February, 2019; v1 submitted 17 February, 2017; originally announced February 2017.

Comments: Revised introduction and expanded numerical examples to include Rcpp and parallel implementation. Supplementary material available from the authors. 38 pages, 8 figures

Journal ref: The American Statistician, 2019

Showing 1–5 of 5 results for author: Self, S