-
Continuous Temporal Learning of Probability Distributions via Neural ODEs with Applications in Continuous Glucose Monitoring Data
Authors:
Antonio Álvarez-López,
Marcos Matabuena
Abstract:
Modeling the continuous--time dynamics of probability distributions from time--dependent data samples is a fundamental problem in many fields, including digital health. The aim is to analyze how the distribution of a biomarker, such as glucose, evolves over time and how these changes may reflect the progression of chronic diseases such as diabetes. In this paper, we propose a novel probabilistic m…
▽ More
Modeling the continuous--time dynamics of probability distributions from time--dependent data samples is a fundamental problem in many fields, including digital health. The aim is to analyze how the distribution of a biomarker, such as glucose, evolves over time and how these changes may reflect the progression of chronic diseases such as diabetes. In this paper, we propose a novel probabilistic model based on a mixture of Gaussian distributions to capture how samples from a continuous-time stochastic process evolve over the time. To model potential distribution shifts over time, we introduce a time-dependent function parameterized by a Neural Ordinary Differential Equation (Neural ODE) and estimate it non--parametrically using the Maximum Mean Discrepancy (MMD). The proposed model is highly interpretable, detects subtle temporal shifts, and remains computationally efficient. Through simulation studies, we show that it performs competitively in terms of estimation accuracy against state-of-the-art, less interpretable methods such as normalized gradient--flows and non--parameteric kernel density estimators. Finally, we demonstrate the utility of our method on digital clinical--trial data, showing how the interventions alters the time-dependent distribution of glucose levels and enabling a rigorous comparison of control and treatment groups from novel mathematical and clinical perspectives.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Variable Selection for Fixed and Random Effects in Multilevel Functional Mixed Effects Models
Authors:
Rahul Ghosal,
Marcos Matabuena,
Enakshi Saha
Abstract:
We develop a new method for simultaneously selecting fixed and random effects in a multilevel functional regression model. The proposed method is motivated by accelerometer-derived physical activity data from the 2011-12 cohort of the National Health and Nutrition Examination Survey (NHANES), where we are interested in identifying age and race-specific heterogeneity in covariate effects on the diu…
▽ More
We develop a new method for simultaneously selecting fixed and random effects in a multilevel functional regression model. The proposed method is motivated by accelerometer-derived physical activity data from the 2011-12 cohort of the National Health and Nutrition Examination Survey (NHANES), where we are interested in identifying age and race-specific heterogeneity in covariate effects on the diurnal patterns of physical activity across the lifespan. Existing methods for variable selection in function-on-scalar regression have primarily been designed for fixed effect selection and for single-level functional data. In high-dimensional multilevel functional regression, the presence of cluster-specific heterogeneity in covariate effects could be detected through sparsity in fixed and random effects, and for this purpose, we propose a multilevel functional mixed effects selection (MuFuMES) method. The fixed and random functional effects are modelled using splines, with spike-and-slab group lasso (SSGL) priors on the unknown parameters of interest and a computationally efficient MAP estimation approach is employed for mixed effect selection through an Expectation Conditional Maximization (ECM) algorithm. Numerical analysis using simulation study illustrates the satisfactory selection accuracy of the variable selection method in having a negligible false-positive and false-negative rate. The proposed method is applied to the accelerometer data from the NHANES 2011-12 cohort, where it effectively identifies age and race-specific heterogeneity in covariate effects on the diurnal patterns of physical activity, recovering biologically meaningful insights.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Variable Selection Methods for Multivariate, Functional, and Complex Biomedical Data in the AI Age
Authors:
Marcos Matabuena
Abstract:
Many problems within personalized medicine and digital health rely on the analysis of continuous-time functional biomarkers and other complex data structures emerging from high-resolution patient monitoring. In this context, this work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset select…
▽ More
Many problems within personalized medicine and digital health rely on the analysis of continuous-time functional biomarkers and other complex data structures emerging from high-resolution patient monitoring. In this context, this work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset selection. Our framework applies to several types of regression models, including linear, quantile, or non parametric additive models, and to a broad range of random responses, such as univariate, multivariate Euclidean data, functional, and even random graphs. Our analysis demonstrates that our proposed methodology outperforms state-of-the-art methods in accuracy and, especially, in speed-achieving several orders of magnitude improvement over competitors across various type of statistical responses as the case of mathematical functions. While our framework is general and is not designed for a specific regression and scientific problem, the article is self-contained and focuses on biomedical applications. In the clinical areas, serves as a valuable resource for professionals in biostatistics, statistics, and artificial intelligence interested in variable selection problem in this new technological AI-era.
△ Less
Submitted 12 January, 2025;
originally announced January 2025.
-
Denoising Data with Measurement Error Using a Reproducing Kernel-based Diffusion Model
Authors:
Mingyang Yi,
Marcos Matabuena,
Ruoyu Wang
Abstract:
The ongoing technological revolution in measurement systems enables the acquisition of high-resolution samples in fields such as engineering, biology, and medicine. However, these observations are often subject to errors from measurement devices. Motivated by this challenge, we propose a denoising framework that employs diffusion models to generate denoised data whose distribution closely approxim…
▽ More
The ongoing technological revolution in measurement systems enables the acquisition of high-resolution samples in fields such as engineering, biology, and medicine. However, these observations are often subject to errors from measurement devices. Motivated by this challenge, we propose a denoising framework that employs diffusion models to generate denoised data whose distribution closely approximates the unobservable, error-free data, thereby permitting standard data analysis based on the denoised data. The key element of our framework is a novel Reproducing Kernel Hilbert Space-based method that trains the diffusion model with only error-contaminated data, admits a closed-form solution, and achieves a fast convergence rate in terms of estimation error. Furthermore, we verify the effectiveness of our method by deriving an upper bound on the Kullback--Leibler divergence between the distributions of the generated denoised data and the error-free data. A series of conducted simulations also verify the promising empirical performance of the proposed method compared to other state-of-the-art methods. To further illustrate the potential of this denoising framework in a real-world application, we apply it in a digital health context, showing how measurement error in continuous glucose monitors can influence conclusions drawn from a clinical trial on diabetes Mellitus disease.
△ Less
Submitted 30 December, 2024;
originally announced January 2025.
-
Glucodensity Functional Profiles Outperform Traditional Continuous Glucose Monitoring Metrics
Authors:
Marcos Matabuena,
Rahul Ghosal,
Javier Enrique Aguilar,
Robert Wagner,
Carmen Fernández Merino,
Juan Sánchez Castro,
Vadim Zipunnikov,
Jukka-Pekka Onnela,
Francisco Gude
Abstract:
Continuous glucose monitoring (CGM) data has revolutionized the management of type 1 diabetes, particularly when integrated with insulin pumps to mitigate clinical events such as hypoglycemia. Recently, there has been growing interest in utilizing CGM devices in clinical studies involving healthy and diabetes populations. However, efficiently exploiting the high temporal resolution of CGM profiles…
▽ More
Continuous glucose monitoring (CGM) data has revolutionized the management of type 1 diabetes, particularly when integrated with insulin pumps to mitigate clinical events such as hypoglycemia. Recently, there has been growing interest in utilizing CGM devices in clinical studies involving healthy and diabetes populations. However, efficiently exploiting the high temporal resolution of CGM profiles remains a significant challenge. Numerous indices -- such as time-in-range metrics and glucose variability measures -- have been proposed, but evidence suggests these metrics overlook critical aspects of glucose dynamic homeostasis. As an alternative method, this paper explores the clinical value of glucodensity metrics in capturing glucose dynamics -- specifically the speed and acceleration of CGM time series -- as new biomarkers for predicting long-term glucose outcomes. Our results demonstrate significant information gains, exceeding 20\% in terms of adjusted $R^2$, in forecasting glycosylated hemoglobin (HbA1c) and fasting plasma glucose (FPG) at five and eight years from baseline AEGIS data, compared to traditional non-CGM and CGM glucose biomarkers. These findings underscore the importance of incorporating more complex CGM functional metrics, such as the glucodensity approach, to fully capture continuous glucose fluctuations across different time-scale resolutions.
△ Less
Submitted 1 October, 2024;
originally announced October 2024.
-
Conformal Prediction in Dynamic Biological Systems
Authors:
Alberto Portela,
Julio R. Banga,
Marcos Matabuena
Abstract:
Uncertainty quantification (UQ) is the process of systematically determining and characterizing the degree of confidence in computational model predictions. In the context of systems biology, especially with dynamic models, UQ is crucial because it addresses the challenges posed by nonlinearity and parameter sensitivity, allowing us to properly understand and extrapolate the behavior of complex bi…
▽ More
Uncertainty quantification (UQ) is the process of systematically determining and characterizing the degree of confidence in computational model predictions. In the context of systems biology, especially with dynamic models, UQ is crucial because it addresses the challenges posed by nonlinearity and parameter sensitivity, allowing us to properly understand and extrapolate the behavior of complex biological systems. Here, we focus on dynamic models represented by deterministic nonlinear ordinary differential equations. Many current UQ approaches in this field rely on Bayesian statistical methods. While powerful, these methods often require strong prior specifications and make parametric assumptions that may not always hold in biological systems. Additionally, these methods face challenges in domains where sample sizes are limited, and statistical inference becomes constrained, with computational speed being a bottleneck in large models of biological systems. As an alternative, we propose the use of conformal inference methods, introducing two novel algorithms that, in some instances, offer non-asymptotic guarantees, enhancing robustness and scalability across various applications. We demonstrate the efficacy of our proposed algorithms through several scenarios, highlighting their advantages over traditional Bayesian approaches. The proposed methods show promising results for diverse biological data structures and scenarios, offering a general framework to quantify uncertainty for dynamic models of biological systems.The software for the methodology and the reproduction of the results is available at https://zenodo.org/doi/10.5281/zenodo.13644870.
△ Less
Submitted 28 October, 2024; v1 submitted 4 September, 2024;
originally announced September 2024.
-
Uncertainty quantification for intervals
Authors:
Carlos García Meixide,
Michael R. Kosorok,
Marcos Matabuena
Abstract:
Data following an interval structure are increasingly prevalent in many scientific applications. In medicine, clinical events are often monitored between two clinical visits, making the exact time of the event unknown and generating outcomes with a range format. As interest in automating healthcare decisions grows, uncertainty quantification via predictive regions becomes essential for developing…
▽ More
Data following an interval structure are increasingly prevalent in many scientific applications. In medicine, clinical events are often monitored between two clinical visits, making the exact time of the event unknown and generating outcomes with a range format. As interest in automating healthcare decisions grows, uncertainty quantification via predictive regions becomes essential for developing reliable and trustworthy predictive algorithms. However, the statistical literature currently lacks a general methodology for interval targets, especially when these outcomes are incomplete due to censoring. We propose an uncertainty quantification algorithm for interval responses and establish its theoretical properties using empirical process arguments based on a newly developed class of functions specifically designed for these interval data structures. Although this paper primarily focuses on deriving predictive regions for interval-censored data, the approach can also be applied to other statistical modeling tasks, such as goodness-of-fit assessments. Finally, the applicability of the method is demonstrated through simulations, showing up to a 60\% improvement in conditional coverage. Our new algorithm is also applied to various biomedical contexts, including two clinical examples: i) sleep duration and its association with cardiovascular diseases, and ii) survival time in relation to physical activity levels.
△ Less
Submitted 30 March, 2025; v1 submitted 29 August, 2024;
originally announced August 2024.
-
Functional Time Transformation Model with Applications to Digital Health
Authors:
Rahul Ghosal,
Marcos Matabuena,
Sujit K. Ghosh
Abstract:
The advent of wearable and sensor technologies now leads to functional predictors which are intrinsically infinite dimensional. While the existing approaches for functional data and survival outcomes lean on the well-established Cox model, the proportional hazard (PH) assumption might not always be suitable in real-world applications. Motivated by physiological signals encountered in digital medic…
▽ More
The advent of wearable and sensor technologies now leads to functional predictors which are intrinsically infinite dimensional. While the existing approaches for functional data and survival outcomes lean on the well-established Cox model, the proportional hazard (PH) assumption might not always be suitable in real-world applications. Motivated by physiological signals encountered in digital medicine, we develop a more general and flexible functional time-transformation model for estimating the conditional survival function with both functional and scalar covariates. A partially functional regression model is used to directly model the survival time on the covariates through an unknown monotone transformation and a known error distribution. We use Bernstein polynomials to model the monotone transformation function and the smooth functional coefficients. A sieve method of maximum likelihood is employed for estimation. Numerical simulations illustrate a satisfactory performance of the proposed method in estimation and inference. We demonstrate the application of the proposed model through two case studies involving wearable data i) Understanding the association between diurnal physical activity pattern and all-cause mortality based on accelerometer data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014 and ii) Modelling Time-to-Hypoglycemia events in a cohort of diabetic patients based on distributional representation of continuous glucose monitoring (CGM) data. The results provide important epidemiological insights into the direct association between survival times and the physiological signals and also exhibit superior predictive performance compared to traditional summary based biomarkers in the CGM study.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Multilevel functional data analysis modeling of human glucose response to meal intake
Authors:
Marcos Matabuena,
Joe Sartini,
Francisco Gude
Abstract:
Glucose meal response information collected via Continuous Glucose Monitoring (CGM) is relevant to the assessment of individual metabolic status and the support of personalized diet prescriptions. However, the complexity of the data produced by CGM monitors pushes the limits of existing analytic methods. CGM data often exhibits substantial within-person variability and has a natural multilevel str…
▽ More
Glucose meal response information collected via Continuous Glucose Monitoring (CGM) is relevant to the assessment of individual metabolic status and the support of personalized diet prescriptions. However, the complexity of the data produced by CGM monitors pushes the limits of existing analytic methods. CGM data often exhibits substantial within-person variability and has a natural multilevel structure. This research is motivated by the analysis of CGM data from individuals without diabetes in the AEGIS study. The dataset includes detailed information on meal timing and nutrition for each individual over different days. The primary focus of this study is to examine CGM glucose responses following patients' meals and explore the time-dependent associations with dietary and patient characteristics. Motivated by this problem, we propose a new analytical framework based on multilevel functional models, including a new functional mixed R-square coefficient. The use of these models illustrates 3 key points: (i) The importance of analyzing glucose responses across the entire functional domain when making diet recommendations; (ii) The differential metabolic responses between normoglycemic and prediabetic patients, particularly with regards to lipid intake; (iii) The importance of including random, person-level effects when modelling this scientific problem.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Conformal uncertainty quantification using kernel depth measures in separable Hilbert spaces
Authors:
Marcos Matabuena,
Rahul Ghosal,
Pavlo Mozharovskyi,
Oscar Hernan Madrid Padilla,
Jukka-Pekka Onnela
Abstract:
Depth measures have gained popularity in the statistical literature for defining level sets in complex data structures like multivariate data, functional data, and graphs. Despite their versatility, integrating depth measures into regression modeling for establishing prediction regions remains underexplored. To address this gap, we propose a novel method utilizing a model-free uncertainty quantifi…
▽ More
Depth measures have gained popularity in the statistical literature for defining level sets in complex data structures like multivariate data, functional data, and graphs. Despite their versatility, integrating depth measures into regression modeling for establishing prediction regions remains underexplored. To address this gap, we propose a novel method utilizing a model-free uncertainty quantification algorithm based on conditional depth measures and conditional kernel mean embeddings. This enables the creation of tailored prediction and tolerance regions in regression models handling complex statistical responses and predictors in separable Hilbert spaces. Our focus in this paper is exclusively on examples where the response is a functional data object. To enhance practicality, we introduce a conformal prediction algorithm, providing non-asymptotic guarantees in the derived prediction region. Additionally, we establish both conditional and unconditional consistency results and fast convergence rates in some special homoscedastic cases. We evaluate the model finite sample performance in extensive simulation studies with different function objects as probability distributions and functional data. Finally, we apply the approach in a digital health application related to physical activity, aiming to offer personalized recommendations in the US. population based on individuals' characteristics.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Uncertainty quantification in metric spaces
Authors:
Gábor Lugosi,
Marcos Matabuena
Abstract:
This paper introduces a novel uncertainty quantification framework for regression models where the response takes values in a separable metric space, and the predictors are in a Euclidean space. The proposed algorithms can efficiently handle large datasets and are agnostic to the predictive base model used. Furthermore, the algorithms possess asymptotic consistency guarantees and, in some special…
▽ More
This paper introduces a novel uncertainty quantification framework for regression models where the response takes values in a separable metric space, and the predictors are in a Euclidean space. The proposed algorithms can efficiently handle large datasets and are agnostic to the predictive base model used. Furthermore, the algorithms possess asymptotic consistency guarantees and, in some special homoscedastic cases, we provide non-asymptotic guarantees. To illustrate the effectiveness of the proposed uncertainty quantification framework, we use a linear regression model for metric responses (known as the global Fréchet model) in various clinical applications related to precision and digital medicine. The different clinical outcomes analyzed are represented as complex statistical objects, including multivariate Euclidean data, Laplacian graphs, and probability distributions.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Optimal Cut-Point Estimation for Functional Digital Biomarkers: Application to Diabetes Risk Stratification via Continuous Glucose Monitoring
Authors:
Oscar Lado-Baleato,
Carla Díaz-Louza,
Francisco Gude,
Marcos Matabuena
Abstract:
Establishing optimal cut-offs for clinical biomarkers is a fundamental statistical problem in epidemiology, clinical trials, and drug discovery. While there is extensive literature regarding the definition of optimal cut-offs for scalar biomarkers, methodologies for analyzing random statistical objects in the more complex spaces associated with random functions and graphs - something increasingly…
▽ More
Establishing optimal cut-offs for clinical biomarkers is a fundamental statistical problem in epidemiology, clinical trials, and drug discovery. While there is extensive literature regarding the definition of optimal cut-offs for scalar biomarkers, methodologies for analyzing random statistical objects in the more complex spaces associated with random functions and graphs - something increasingly required in the field of modern digital health applications - are lacking. This paper proposes a new, general, simple methodology for defining optimal cut-offs for random objects residing in separable Hilbert spaces. Its underlying motivation is the need to create new, digital health rules for the detection of diabetes mellitus, and thus better exploit the continuous high-dimensional functional information provided by continuous glucose monitors (CGM). A functional cut-off for identifying diabetes is offered, based on glucose distributional representations from CGM time series. This work may be a valuable resource for researchers interested in defining and validating new digital biomarkers for biosensor time series
△ Less
Submitted 9 March, 2025; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Screening for Diabetes Mellitus in the U.S. Population Using Neural Network Models and Complex Survey Designs
Authors:
Marcos Matabuena,
Juan C. Vidal,
Rahul Ghosal,
Jukka-Pekka Onnela
Abstract:
Complex survey designs are commonly employed in many medical cohorts. In such scenarios, developing case-specific predictive risk score models that reflect the unique characteristics of the study design is essential for minimizing selective biases in the statistical results. The objectives of this paper are to: (i) propose a general predictive framework for regression and classification using neur…
▽ More
Complex survey designs are commonly employed in many medical cohorts. In such scenarios, developing case-specific predictive risk score models that reflect the unique characteristics of the study design is essential for minimizing selective biases in the statistical results. The objectives of this paper are to: (i) propose a general predictive framework for regression and classification using neural network (NN) modeling that incorporates survey weights into the estimation process; (ii) introduce an uncertainty quantification algorithm for model prediction tailored to data from complex survey designs; and (iii) apply this method to develop robust risk score models for assessing the risk of Diabetes Mellitus in the US population, utilizing data from the NHANES 2011-2014 cohort. The results indicate that models of varying complexity, each utilizing a different set of variables, demonstrate different discriminative power for predicting diabetes (with different economic cost), yet yield generalizable results at the population level. Although the focus is on diabetes, this NN predictive framework is adaptable for developing clinical models across a diverse range of diseases and medical cohorts. The software and data used in this paper are publicly available on GitHub.
△ Less
Submitted 25 March, 2025; v1 submitted 28 March, 2024;
originally announced March 2024.
-
Personalized Imputation in metric spaces via conformal prediction: Applications in Predicting Diabetes Development with Continuous Glucose Monitoring Information
Authors:
Marcos Matabuena,
Carla Díaz-Louzao,
Rahul Ghosal,
Francisco Gude
Abstract:
The challenge of handling missing data is widespread in modern data analysis, particularly during the preprocessing phase and in various inferential modeling tasks. Although numerous algorithms exist for imputing missing data, the assessment of imputation quality at the patient level often lacks personalized statistical approaches. Moreover, there is a scarcity of imputation methods for metric spa…
▽ More
The challenge of handling missing data is widespread in modern data analysis, particularly during the preprocessing phase and in various inferential modeling tasks. Although numerous algorithms exist for imputing missing data, the assessment of imputation quality at the patient level often lacks personalized statistical approaches. Moreover, there is a scarcity of imputation methods for metric space based statistical objects. The aim of this paper is to introduce a novel two-step framework that comprises: (i) a imputation methods for statistical objects taking values in metrics spaces, and (ii) a criterion for personalizing imputation using conformal inference techniques. This work is motivated by the need to impute distributional functional representations of continuous glucose monitoring (CGM) data within the context of a longitudinal study on diabetes, where a significant fraction of patients do not have available CGM profiles. The importance of these methods is illustrated by evaluating the effectiveness of CGM data as new digital biomarkers to predict the time to diabetes onset in healthy populations. To address these scientific challenges, we propose: (i) a new regression algorithm for missing responses; (ii) novel conformal prediction algorithms tailored for metric spaces with a focus on density responses within the 2-Wasserstein geometry; (iii) a broadly applicable personalized imputation method criterion, designed to enhance both of the aforementioned strategies, yet valid across any statistical model and data structure. Our findings reveal that incorporating CGM data into diabetes time-to-event analysis, augmented with a novel personalization phase of imputation, significantly enhances predictive accuracy by over ten percent compared to traditional predictive models for time to diabetes.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Multilevel functional distributional models with application to continuous glucose monitoring in diabetes clinical trials
Authors:
Marcos Matabuena,
Ciprian M. Crainiceanu
Abstract:
Continuous glucose monitoring (CGM) is a minimally invasive technology that allows continuous monitoring of an individual's blood glucose. We focus on a large clinical trial that collected CGM data every few minutes for 26 weeks and assumes that the basic observation unit is the distribution of CGM observations in a four-week interval. The resulting data structure is multilevel (because each indiv…
▽ More
Continuous glucose monitoring (CGM) is a minimally invasive technology that allows continuous monitoring of an individual's blood glucose. We focus on a large clinical trial that collected CGM data every few minutes for 26 weeks and assumes that the basic observation unit is the distribution of CGM observations in a four-week interval. The resulting data structure is multilevel (because each individual has multiple months of data) and distributional (because the data for each four-week interval is represented as a distribution). The scientific goals are to: (1) identify and quantify the effects of factors that affect glycemic control in type 1 diabetes (T1D) patients; and (2) identify and characterize the patients who respond to treatment. To address these goals, we propose a new multilevel functional model that treats the CGM distributions as a response. Methods are motivated by and applied to data collected by The Juvenile Diabetes Research Foundation Continuous Glucose Monitoring Group. Reproducible code for the methods introduced here is available on GitHub.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
kNN Algorithm for Conditional Mean and Variance Estimation with Automated Uncertainty Quantification and Variable Selection
Authors:
Marcos Matabuena,
Juan C. Vidal,
Oscar Hernan Madrid Padilla,
Jukka-Pekka Onnela
Abstract:
In this paper, we introduce a kNN-based regression method that synergizes the scalability and adaptability of traditional non-parametric kNN models with a novel variable selection technique. This method focuses on accurately estimating the conditional mean and variance of random response variables, thereby effectively characterizing conditional distributions across diverse scenarios.Our approach i…
▽ More
In this paper, we introduce a kNN-based regression method that synergizes the scalability and adaptability of traditional non-parametric kNN models with a novel variable selection technique. This method focuses on accurately estimating the conditional mean and variance of random response variables, thereby effectively characterizing conditional distributions across diverse scenarios.Our approach incorporates a robust uncertainty quantification mechanism, leveraging our prior estimation work on conditional mean and variance. The employment of kNN ensures scalable computational efficiency in predicting intervals and statistical accuracy in line with optimal non-parametric rates. Additionally, we introduce a new kNN semi-parametric algorithm for estimating ROC curves, accounting for covariates. For selecting the smoothing parameter k, we propose an algorithm with theoretical guarantees.Incorporation of variable selection enhances the performance of the method significantly over conventional kNN techniques in various modeling tasks. We validate the approach through simulations in low, moderate, and high-dimensional covariate spaces. The algorithm's effectiveness is particularly notable in biomedical applications as demonstrated in two case studies. Concluding with a theoretical analysis, we highlight the consistency and convergence rate of our method over traditional kNN models, particularly when the underlying regression model takes values in a low-dimensional space.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Multivariate Scalar on Multidimensional Distribution Regression
Authors:
Rahul Ghosal,
Marcos Matabuena
Abstract:
We develop a new method for multivariate scalar on multidimensional distribution regression. Traditional approaches typically analyze isolated univariate scalar outcomes or consider unidimensional distributional representations as predictors. However, these approaches are sub-optimal because: i) they fail to utilize the dependence between the distributional predictors: ii) neglect the correlation…
▽ More
We develop a new method for multivariate scalar on multidimensional distribution regression. Traditional approaches typically analyze isolated univariate scalar outcomes or consider unidimensional distributional representations as predictors. However, these approaches are sub-optimal because: i) they fail to utilize the dependence between the distributional predictors: ii) neglect the correlation structure of the response. To overcome these limitations, we propose a multivariate distributional analysis framework that harnesses the power of multivariate density functions and multitask learning. We develop a computationally efficient semiparametric estimation method for modelling the effect of the latent joint density on multivariate response of interest. Additionally, we introduce a new conformal algorithm for quantifying the uncertainty of regression models with multivariate responses and distributional predictors, providing valuable insights into the conditional distribution of the response. We have validated the effectiveness of our proposed method through comprehensive numerical simulations, clearly demonstrating its superior performance compared to traditional methods. The application of the proposed method is demonstrated on tri-axial accelerometer data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014 for modelling the association between cognitive scores across various domains and distributional representation of physical activity among older adult population. Our results highlight the advantages of the proposed approach, emphasizing the significance of incorporating complete spatial information derived from the accelerometer device.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Predicting Distributions of Physical Activity Profiles in the NHANES Database Using a Partially Linear Fréchet Single Index Model
Authors:
Marcos Matabuena,
Aritra Ghosal,
Wendy Meiring,
Alexander Petersen
Abstract:
Object-oriented data analysis is a fascinating and evolving field in modern statistical science, with the potential to make significant contributions to biomedical applications. This statistical framework facilitates the development of new methods to analyze complex data objects that capture more information than traditional clinical biomarkers. This paper applies the object-oriented framework to…
▽ More
Object-oriented data analysis is a fascinating and evolving field in modern statistical science, with the potential to make significant contributions to biomedical applications. This statistical framework facilitates the development of new methods to analyze complex data objects that capture more information than traditional clinical biomarkers. This paper applies the object-oriented framework to analyze physical activity levels, measured by accelerometers, as response objects in a regression model. Unlike traditional summary metrics, we utilize a recently proposed representation of physical activity data as a distributional object, providing a more nuanced and complete profile of individual energy expenditure across all ranges of monitoring intensity. A novel hybrid Fréchet regression model is proposed and applied to US population accelerometer data from National Health and Nutrition Examination Survey (NHANES) 2011-2014. The semi-parametric nature of the model allows for the inclusion of nonlinear effects for critical variables, such as age, which are biologically known to have subtle impacts on physical activity. Simultaneously, the inclusion of linear effects preserves interpretability for other variables, particularly categorical covariates such as ethnicity and sex. The results obtained are valuable from a public health perspective and could lead to new strategies for optimizing physical activity interventions in specific American subpopulations.
△ Less
Submitted 9 March, 2025; v1 submitted 15 February, 2023;
originally announced February 2023.
-
Functional proportional hazards mixture cure model and its application to modelling the association between cancer mortality and physical activity in NHANES 2003-2006
Authors:
Rahul Ghosal,
Marcos Matabuena,
Jiajia Zhang
Abstract:
We develop a functional proportional hazards mixture cure (FPHMC) model with scalar and functional covariates measured at the baseline. The mixture cure model, useful in studying populations with a cure fraction of a particular event of interest is extended to functional data. We employ the EM algorithm and develop a semiparametric penalized spline-based approach to estimate the dynamic functional…
▽ More
We develop a functional proportional hazards mixture cure (FPHMC) model with scalar and functional covariates measured at the baseline. The mixture cure model, useful in studying populations with a cure fraction of a particular event of interest is extended to functional data. We employ the EM algorithm and develop a semiparametric penalized spline-based approach to estimate the dynamic functional coefficients of the incidence and the latency part. The proposed method is computationally efficient and simultaneously incorporates smoothness in the estimated functional coefficients via roughness penalty. Simulation studies illustrate a satisfactory performance of the proposed method in accurately estimating the model parameters and the baseline survival function. Finally, the clinical potential of the model is demonstrated in two real data examples that incorporate rich high-dimensional biomedical signals as functional covariates measured at the baseline and constitute novel domains to apply cure survival models in contemporary medical situations. In particular, we analyze i) minute-by-minute physical activity data from the National Health and Nutrition Examination Survey (NHANES) 2003-2006 to study the association between diurnal patterns of physical activity (PA) at baseline and all cancer mortality through 2019 while adjusting for other biological factors; ii) the impact of daily functional measures of disease severity collected in the intensive care unit on post ICU recovery and mortality event. Our findings provide novel epidemiological insights into the association between daily patterns of PA and cancer mortality. Software implementation and illustration of the proposed estimation method is provided in R.
△ Less
Submitted 30 March, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Kernel Biclustering algorithm in Hilbert Spaces
Authors:
Marcos Matabuena,
J. C Vidal,
Oscar Hernan Madrid Padilla,
Dino Sejdinovic
Abstract:
Biclustering algorithms partition data and covariates simultaneously, providing new insights in several domains, such as analyzing gene expression to discover new biological functions. This paper develops a new model-free biclustering algorithm in abstract spaces using the notions of energy distance (ED) and the maximum mean discrepancy (MMD) -- two distances between probability distributions capa…
▽ More
Biclustering algorithms partition data and covariates simultaneously, providing new insights in several domains, such as analyzing gene expression to discover new biological functions. This paper develops a new model-free biclustering algorithm in abstract spaces using the notions of energy distance (ED) and the maximum mean discrepancy (MMD) -- two distances between probability distributions capable of handling complex data such as curves or graphs. The proposed method can learn more general and complex cluster shapes than most existing literature approaches, which usually focus on detecting mean and variance differences. Although the biclustering configurations of our approach are constrained to create disjoint structures at the datum and covariate levels, the results are competitive. Our results are similar to state-of-the-art methods in their optimal scenarios, assuming a proper kernel choice, outperforming them when cluster differences are concentrated in higher-order moments. The model's performance has been tested in several situations that involve simulated and real-world datasets. Finally, new theoretical consistency results are established using some tools of the theory of optimal transport.
△ Less
Submitted 7 August, 2022;
originally announced August 2022.
-
Neural interval-censored survival regression with feature selection
Authors:
Carlos García Meixide,
Marcos Matabuena,
Louis Abraham,
Michael R. Kosorok
Abstract:
Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high-dimensional datasets, such as omics and medical image data. However, the literature on non-linear regression algorithms and variable selection techniques for interval-censoring is either limited or non-e…
▽ More
Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high-dimensional datasets, such as omics and medical image data. However, the literature on non-linear regression algorithms and variable selection techniques for interval-censoring is either limited or non-existent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval-censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: i) a variable selection phase leveraging recent advances on sparse neural network architectures, ii) a regression model targeting prediction of the interval-censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real-world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring non-linear relationships.
△ Less
Submitted 22 August, 2024; v1 submitted 14 June, 2022;
originally announced June 2022.
-
Hypothesis testing for matched pairs with missing data by maximum mean discrepancy: An application to continuous glucose monitoring
Authors:
Marcos Matabuena,
Paulo Félix,
Marc Ditzhaus,
Juan Vidal,
Francisco Gude
Abstract:
A frequent problem in statistical science is how to properly handle missing data in matched paired observations. There is a large body of literature coping with the univariate case. Yet, the ongoing technological progress in measuring biological systems raises the need for addressing more complex data, e.g., graphs, strings and probability distributions, among others. In order to fill this gap, th…
▽ More
A frequent problem in statistical science is how to properly handle missing data in matched paired observations. There is a large body of literature coping with the univariate case. Yet, the ongoing technological progress in measuring biological systems raises the need for addressing more complex data, e.g., graphs, strings and probability distributions, among others. In order to fill this gap, this paper proposes new estimators of the maximum mean discrepancy (MMD) to handle complex matched pairs with missing data. These estimators can detect differences in data distributions under different missingness mechanisms. The validity of this approach is proven and further studied in an extensive simulation study, and results of statistical consistency are provided. Data from continuous glucose monitoring in a longitudinal population-based diabetes study are used to illustrate the application of this approach. By employing the new distributional representations together with cluster analysis, new clinical criteria on how glucose changes vary at the distributional level over five years can be explored.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Distributional data analysis of accelerometer data from the NHANES database using nonparametric survey regression models
Authors:
Marcos Matabuena,
Alexander Petersen
Abstract:
Accelerometers enable an objective measurement of physical activity levels among groups of individuals in free-living environments, providing high-resolution detail about physical activity changes at different time scales. Current approaches used in the literature for analyzing such data typically employ summary measures such as total inactivity time or compositional metrics. However, at the conce…
▽ More
Accelerometers enable an objective measurement of physical activity levels among groups of individuals in free-living environments, providing high-resolution detail about physical activity changes at different time scales. Current approaches used in the literature for analyzing such data typically employ summary measures such as total inactivity time or compositional metrics. However, at the conceptual level, these methods have the potential disadvantage of discarding important information from recorded data when calculating these summaries and metrics since these typically depend on cut-offs related to exercise intensity zones chosen subjectively or even arbitrarily. Furthermore, much of the data collected in these studies follow complex survey designs. Then, using specific estimation strategies adapted to a particular sampling mechanism is mandatory. The aim of this paper is two-fold. First, a new functional representation of a distributional nature accelerometer data is introduced to build a complete individualized profile of each subject's physical activity levels. Second, we extend two nonparametric functional regression models, kernel smoothing and kernel ridge regression, to handle survey data and obtain reliable conclusions about the influence of physical activity in the different analyses performed in the complex sampling design NHANES cohort and so, show representation advantages.
△ Less
Submitted 20 January, 2022; v1 submitted 2 April, 2021;
originally announced April 2021.
-
Are Multilevel functional models the next step in sports biomechanics and wearable technology? A case study of Knee Biomechanics patterns in typical training sessions of recreational runners
Authors:
Marcos Matabuena,
Sherveen Riazati,
Nick Caplan,
Phil Hayes
Abstract:
This paper illustrates how multilevel functional models can detect and characterize biomechanical changes along different sport training sessions. Our analysis focuses on the relevant cases to identify differences in knee biomechanics in recreational runners during low and high-intensity exercise sessions with the same energy expenditure by recording $20$ steps. To do so, we review the existing li…
▽ More
This paper illustrates how multilevel functional models can detect and characterize biomechanical changes along different sport training sessions. Our analysis focuses on the relevant cases to identify differences in knee biomechanics in recreational runners during low and high-intensity exercise sessions with the same energy expenditure by recording $20$ steps. To do so, we review the existing literature of multilevel models, and then, we propose a new hypothesis test to look at the changes between different levels of the multilevel model as low and high-intensity training sessions. We also evaluate the reliability of measures recorded in three-dimension knee angles from the functional intra-class correlation coefficient (ICC) obtained from the decomposition performed with the multilevel funcional model taking into account $20$ measures recorded in each test. The results show that there are no statistically significant differences between the two modes of exercise. However, we have to be careful with the conclusions since, as we have shown, human gait-patterns are very individual and heterogeneous between groups of athletes, and other alternatives to the p-value may be more appropriate to detect statistical differences in biomechanical changes in this context.
△ Less
Submitted 5 April, 2021; v1 submitted 29 March, 2021;
originally announced March 2021.
-
Glucose values prediction five years ahead with a new framework of missing responses in reproducing kernel Hilbert spaces, and the use of continuous glucose monitoring technology
Authors:
Marcos Matabuena,
Paulo Félix,
Carlos Meijide-Garcia,
Francisco Gude
Abstract:
AEGIS study possesses unique information on longitudinal changes in circulating glucose through continuous glucose monitoring technology (CGM). However, as usual in longitudinal medical studies, there is a significant amount of missing data in the outcome variables. For example, 40 percent of glycosylated hemoglobin (A1C) biomarker data are missing five years ahead. With the purpose to reduce the…
▽ More
AEGIS study possesses unique information on longitudinal changes in circulating glucose through continuous glucose monitoring technology (CGM). However, as usual in longitudinal medical studies, there is a significant amount of missing data in the outcome variables. For example, 40 percent of glycosylated hemoglobin (A1C) biomarker data are missing five years ahead. With the purpose to reduce the impact of this issue, this article proposes a new data analysis framework based on learning in reproducing kernel Hilbert spaces (RKHS) with missing responses that allows to capture non-linear relations between variable studies in different supervised modeling tasks. First, we extend the Hilbert-Schmidt dependence measure to test statistical independence in this context introducing a new bootstrap procedure, for which we prove consistency. Next, we adapt or use existing models of variable selection, regression, and conformal inference to obtain new clinical findings about glucose changes five years ahead with the AEGIS data. The most relevant findings are summarized below: i) We identify new factors associated with long-term glucose evolution; ii) We show the clinical sensibility of CGM data to detect changes in glucose metabolism; iii) We can improve clinical interventions based on our algorithms' expected glucose changes according to patients' baseline characteristics.
△ Less
Submitted 14 December, 2020; v1 submitted 11 December, 2020;
originally announced December 2020.
-
Glucodensities: a new representation of glucose profiles using distributional data analysis
Authors:
Marcos Matabuena,
Alexander Petersen,
Juan C. Vidal,
Francisco Gude
Abstract:
Biosensor data has the potential ability to improve disease control and detection. However, the analysis of these data under free-living conditions is not feasible with current statistical techniques. To address this challenge, we introduce a new functional representation of biosensor data, termed the glucodensity, together with a data analysis framework based on distances between them. The new da…
▽ More
Biosensor data has the potential ability to improve disease control and detection. However, the analysis of these data under free-living conditions is not feasible with current statistical techniques. To address this challenge, we introduce a new functional representation of biosensor data, termed the glucodensity, together with a data analysis framework based on distances between them. The new data analysis procedure is illustrated through an application in diabetes with continuous-time glucose monitoring (CGM) data. In this domain, we show marked improvement with respect to state of the art analysis methods. In particular, our findings demonstrate that i) the glucodensity possesses an extraordinary clinical sensitivity to capture the typical biomarkers used in the standard clinical practice in diabetes, ii) previous biomarkers cannot accurately predict glucodensity, so that the latter is a richer source of information, and iii) the glucodensity is a natural generalization of the time in range metric, this being the gold standard in the handling of CGM data. Furthermore, the new method overcomes many of the drawbacks of time in range metrics, and provides deeper insight into assessing glucose metabolism.
△ Less
Submitted 18 August, 2020;
originally announced August 2020.
-
COVID-19: Estimating spread in Spain solving an inverse problem with a probabilistic model
Authors:
Marcos Matabuena,
Carlos Meijide-García,
Pablo Rodríguez-Mier,
Víctor Leborán
Abstract:
We introduce a new probabilistic model to estimate the real spread of the novel SARS-CoV-2 virus along regions or countries. Our model simulates the behavior of each individual in a population according to a probabilistic model through an inverse problem; we estimate the real number of recovered and infected people using mortality records. In addition, the model is dynamic in the sense that it tak…
▽ More
We introduce a new probabilistic model to estimate the real spread of the novel SARS-CoV-2 virus along regions or countries. Our model simulates the behavior of each individual in a population according to a probabilistic model through an inverse problem; we estimate the real number of recovered and infected people using mortality records. In addition, the model is dynamic in the sense that it takes into account the policy measures introduced when we solve the inverse problem. The results obtained in Spain have particular practical relevance: the number of infected individuals can be $17$ times higher than the data provided by the Spanish government on April $26$ $th$ in the worst-case scenario. Assuming that the number of fatalities reflected in the statistics is correct, $9.8$ percent of the population may be contaminated or have already been recovered from the virus in Madrid, one of the most affected regions in Spain. However, if we assume that the number of fatalities is twice as high as the official numbers, the number of infections could have reached $19.5\%$. In Galicia, one of the regions where the effect has been the least, the number of infections does not reach $2.5 \%$ . Based on our findings, we can: i) estimate the risk of a new outbreak before Autumn if we lift the quarantine; ii) may know the degree of immunization of the population in each region; and iii) forecast or simulate the effect of the policies to be introduced in the future based on the number of infected or recovered individuals in the population.
△ Less
Submitted 3 May, 2020; v1 submitted 28 April, 2020;
originally announced April 2020.
-
Energy distance and kernel mean embeddings for two-sample survival testing
Authors:
Marcos Matabuena,
Oscar Hernan Madrid Padilla
Abstract:
We study the comparison problem of distribution equality between two random samples under a right censoring scheme. To address this problem, we design a series of tests based on energy distance and kernel mean embeddings. We calibrate our tests using permutation methods and prove that they are consistent against all fixed continuous alternatives. To evaluate our proposed tests, we simulate surviva…
▽ More
We study the comparison problem of distribution equality between two random samples under a right censoring scheme. To address this problem, we design a series of tests based on energy distance and kernel mean embeddings. We calibrate our tests using permutation methods and prove that they are consistent against all fixed continuous alternatives. To evaluate our proposed tests, we simulate survival curves from previous clinical trials. Additionally, we provide practitioners with a set of recommendations on how to select parameters/distances for the delay effect problem. Based on the method for parameter tunning that we propose, we show that our tests demonstrate a considerable gain of statistical power against classical survival tests.
△ Less
Submitted 9 December, 2019;
originally announced December 2019.
-
Energy distance and kernel mean embedding for two sample survival test
Authors:
Marcos Matabuena
Abstract:
In this article a new family of tests is proposed for the comparison problem of the equality of distribution of two-sample under right censoring scheme. The tests are based on energy distance and kernels mean embedding, are calibrated by permutations and are consistent against all alternatives. The good performance of the new tests in real situations with finite samples is established with a simul…
▽ More
In this article a new family of tests is proposed for the comparison problem of the equality of distribution of two-sample under right censoring scheme. The tests are based on energy distance and kernels mean embedding, are calibrated by permutations and are consistent against all alternatives. The good performance of the new tests in real situations with finite samples is established with a simulation study.
△ Less
Submitted 3 January, 2019;
originally announced January 2019.