-
Walking Fingerprinting Using Wrist Accelerometry During Activities of Daily Living in NHANES
Authors:
Lily Koffman,
John Muschelli III,
Ciprian Crainiceanu
Abstract:
We propose a method for identifying individuals based on their continuously monitored wrist-worn accelerometry during activities of daily living. The method consists of three steps: (1) using Adaptive Empirical Pattern Transformation (ADEPT), a highly specific method to identify walking; (2) transforming the accelerometry time series into an image that corresponds to the joint distribution of the…
▽ More
We propose a method for identifying individuals based on their continuously monitored wrist-worn accelerometry during activities of daily living. The method consists of three steps: (1) using Adaptive Empirical Pattern Transformation (ADEPT), a highly specific method to identify walking; (2) transforming the accelerometry time series into an image that corresponds to the joint distribution of the time series and its lags; and (3) using the resulting images to construct a person-specific walking fingerprint. The method is applied to 15,000 individuals from the National Health and Nutrition Examination Survey (NHANES) with up to 7 days of wrist accelerometry data collected at 80 Hertz. The resulting dataset contains more than 10 terabytes, is roughly 2 to 3 orders of magnitude larger than previous datasets used for activity recognition, is collected in the free living environment, and does not contain labels for walking periods. Using extensive cross-validation studies, we show that our method is highly predictive and can be successfully extended to a large, heterogeneous sample representative of the U.S. population: in the highest-performing model, the correct participant is in the top 1% of predictions 96% of the time.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Comparing Step Counting Algorithms for High-Resolution Wrist Accelerometry Data in NHANES 2011-2014
Authors:
Lily Koffman,
Ciprian Crainiceanu,
John Muschelli III
Abstract:
Purpose: To quantify the relative performance of step counting algorithms in studies that collect free-living high-resolution wrist accelerometry data and to highlight the implications of using these algorithms in translational research. Methods: Five step counting algorithms (four open source and one proprietary) were applied to the publicly available, free-living, high-resolution wrist accelerom…
▽ More
Purpose: To quantify the relative performance of step counting algorithms in studies that collect free-living high-resolution wrist accelerometry data and to highlight the implications of using these algorithms in translational research. Methods: Five step counting algorithms (four open source and one proprietary) were applied to the publicly available, free-living, high-resolution wrist accelerometry data collected by the National Health and Nutrition Examination Survey (NHANES) in 2011-2014. The mean daily total step counts were compared in terms of correlation, predictive performance, and estimated hazard ratios of mortality. Results: The estimated number of steps were highly correlated (median=0.91, range 0.77 to 0.98), had high and comparable predictive performance of mortality (median concordance=0.72, range 0.70 to 0.73). The distributions of the number of steps in the population varied widely (mean step counts range from 2,453 to 12,169). Hazard ratios of mortality associated with a 500-step increase per day varied among step counting algorithms between HR=0.88 and 0.96, corresponding to a 300% difference in mortality risk reduction ([1-0.88]/[1-0.96]=3). Conclusion: Different step counting algorithms provide correlated step estimates and have similar predictive performance that is better than traditional predictors of mortality. However, they provide widely different distributions of step counts and estimated reductions in mortality risk for a 500-step increase.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Open Case Studies: Statistics and Data Science Education through Real-World Applications
Authors:
Carrie Wright,
Qier Meng,
Michael R. Breshock,
Lyla Atta,
Margaret A. Taub,
Leah R Jager,
John Muschelli,
Stephanie C. Hicks
Abstract:
With unprecedented and growing interest in data science education, there are limited educator materials that provide meaningful opportunities for learners to practice statistical thinking, as defined by Wild and Pfannkuch (1999), with messy data addressing real-world challenges. As a solution, Nolan and Speed (1999) advocated for bringing applications to the forefront in undergraduate statistics c…
▽ More
With unprecedented and growing interest in data science education, there are limited educator materials that provide meaningful opportunities for learners to practice statistical thinking, as defined by Wild and Pfannkuch (1999), with messy data addressing real-world challenges. As a solution, Nolan and Speed (1999) advocated for bringing applications to the forefront in undergraduate statistics curriculum with the use of in-depth case studies to encourage and develop statistical thinking in the classroom. Limitations to this approach include the significant time investment required to develop a case study -- namely, to select a motivating question and to create an illustrative data analysis -- and the domain expertise needed. As a result, case studies based on realistic challenges, not toy examples, are scarce. To address this, we developed the Open Case Studies (https://www.opencasestudies.org) project, which offers a new statistical and data science education case study model. This educational resource provides self-contained, multimodal, peer-reviewed, and open-source guides (or case studies) from real-world examples for active experiences of complete data analyses. We developed an educator's guide describing how to most effectively use the case studies, how to modify and adapt components of the case studies in the classroom, and how to contribute new case studies. (https://www.opencasestudies.org/OCS_Guide).
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
ciftiTools: A package for reading, writing, visualizing and manipulating CIFTI files in R
Authors:
Damon Pham,
John Muschelli,
Amanda Mejia
Abstract:
There is significant interest in adopting surface- and grayordinate-based analysis of MR data for a number of reasons, including improved whole-cortex visualization, the ability to perform surface smoothing to avoid issues associated with volumetric smoothing, improved inter-subject alignment, and reduced dimensionality. The CIFTI grayordinate file format introduced by the Human Connectome Project…
▽ More
There is significant interest in adopting surface- and grayordinate-based analysis of MR data for a number of reasons, including improved whole-cortex visualization, the ability to perform surface smoothing to avoid issues associated with volumetric smoothing, improved inter-subject alignment, and reduced dimensionality. The CIFTI grayordinate file format introduced by the Human Connectome Project further advances grayordinate-based analysis by combining gray matter data from the left and right cortical hemispheres with gray matter data from the subcortex and cerebellum into a single file. Analyses performed in grayordinate space are well-suited to leverage information shared across the brain and across subjects through both traditional analysis techniques and more advanced statistical methods, including Bayesian methods. The R statistical environment facilitates use of advanced statistical techniques, yet little support for grayordinates analysis has been previously available in R. Indeed, few comprehensive programmatic tools for working with CIFTI files have been available in any language. Here, we present the ciftiTools R package, which provides a unified environment for reading, writing, visualizing, and manipulating CIFTI files and related data formats. We illustrate ciftiTools' convenient and user-friendly suite of tools for working with grayordinates and surface geometry data in R, and we describe how ciftiTools is being utilized to advance the statistical analysis of grayordinate-based functional MRI data.
△ Less
Submitted 18 January, 2022; v1 submitted 21 June, 2021;
originally announced June 2021.
-
ROC and AUC with a Binary Predictor: a Potentially Misleading Metric
Authors:
John Muschelli
Abstract:
In analysis of binary outcomes, the receiver operator characteristic (ROC) curve is heavily used to show the performance of a model or algorithm. The ROC curve is informative about the performance over a series of thresholds and can be summarized by the area under the curve (AUC), a single number. When a predictor is categorical, the ROC curve has one less than number of categories as potential th…
▽ More
In analysis of binary outcomes, the receiver operator characteristic (ROC) curve is heavily used to show the performance of a model or algorithm. The ROC curve is informative about the performance over a series of thresholds and can be summarized by the area under the curve (AUC), a single number. When a predictor is categorical, the ROC curve has one less than number of categories as potential thresholds; when the predictor is binary there is only one threshold. As the AUC may be used in decision-making processes on determining the best model, it important to discuss how it agrees with the intuition from the ROC curve. We discuss how the interpolation of the curve between thresholds with binary predictors can largely change the AUC. Overall, we show using a linear interpolation from the ROC curve with binary predictors corresponds to the estimated AUC, which is most commonly done in software, which we believe can lead to misleading results. We compare R, Python, Stata, and SAS software implementations. We recommend using reporting the interpolation used and discuss the merit of using the step function interpolator, also referred to as the "pessimistic" approach by Fawcett (2006).
△ Less
Submitted 5 May, 2020; v1 submitted 12 March, 2019;
originally announced March 2019.
-
Relating multi-sequence longitudinal intensity profiles and clinical covariates in new multiple sclerosis lesions
Authors:
Elizabeth M. Sweeney,
Russell T. Shinohara,
Blake E. Dewey,
Matthew K. Schindler,
John Muschelli,
Daniel S. Reich,
Ciprian M. Crainiceanu,
Ani Eloyan
Abstract:
Structural magnetic resonance imaging (MRI) can be used to detect lesions in the brains of multiple sclerosis (MS) patients. The formation of these lesions is a complex process involving inflammation, tissue damage, and tissue repair, all of which are visible on MRI. Here we characterize the lesion formation process on longitudinal, multi-sequence structural MRI from 34 MS patients and relate the…
▽ More
Structural magnetic resonance imaging (MRI) can be used to detect lesions in the brains of multiple sclerosis (MS) patients. The formation of these lesions is a complex process involving inflammation, tissue damage, and tissue repair, all of which are visible on MRI. Here we characterize the lesion formation process on longitudinal, multi-sequence structural MRI from 34 MS patients and relate the longitudinal changes we observe within lesions to therapeutic interventions. In this article, we first outline a pipeline to extract voxel level, multi-sequence longitudinal profiles from four MRI sequences within lesion tissue. We then propose two models to relate clinical covariates to the longitudinal profiles. The first model is a principal component analysis (PCA) regression model, which collapses the information from all four profiles into a scalar value. We find that the score on the first PC identifies areas of slow, long-term intensity changes within the lesion at a voxel level, as validated by two experienced clinicians, a neuroradiologist and a neurologist. On a quality scale of 1 to 4 (4 being the highest) the neuroradiologist gave the score on the first PC a median rating of 4 (95% CI: [4,4]), and the neurologist gave it a median rating of 3 (95% CI: [3,3]). In the PCA regression model, we find that treatment with disease modifying therapies (p-value < 0.01), steroids (p-value < 0.01), and being closer to the boundary of abnormal signal intensity (p-value < 0.01) are associated with a return of a voxel to intensity values closer to that of normal-appearing tissue. The second model is a function-on-scalar regression, which allows for assessment of the individual time points at which the covariates are associated with the profiles. In the function-on-scalar regression both age and distance to the boundary were found to have a statistically significant association with the profiles.
△ Less
Submitted 28 September, 2015;
originally announced September 2015.