Skip to main content

Showing 1–29 of 29 results for author: Matabuena, M

Searching in archive stat. Search in all archives.
.
  1. arXiv:2505.08698  [pdf, ps, other

    stat.ML cs.LG math.DS stat.AP stat.ME

    Continuous Temporal Learning of Probability Distributions via Neural ODEs with Applications in Continuous Glucose Monitoring Data

    Authors: Antonio Álvarez-López, Marcos Matabuena

    Abstract: Modeling the continuous--time dynamics of probability distributions from time--dependent data samples is a fundamental problem in many fields, including digital health. The aim is to analyze how the distribution of a biomarker, such as glucose, evolves over time and how these changes may reflect the progression of chronic diseases such as diabetes. In this paper, we propose a novel probabilistic m… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  2. arXiv:2505.05416  [pdf, other

    stat.ME

    Variable Selection for Fixed and Random Effects in Multilevel Functional Mixed Effects Models

    Authors: Rahul Ghosal, Marcos Matabuena, Enakshi Saha

    Abstract: We develop a new method for simultaneously selecting fixed and random effects in a multilevel functional regression model. The proposed method is motivated by accelerometer-derived physical activity data from the 2011-12 cohort of the National Health and Nutrition Examination Survey (NHANES), where we are interested in identifying age and race-specific heterogeneity in covariate effects on the diu… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  3. arXiv:2501.06868  [pdf, other

    stat.ML cs.LG stat.AP stat.ME

    Variable Selection Methods for Multivariate, Functional, and Complex Biomedical Data in the AI Age

    Authors: Marcos Matabuena

    Abstract: Many problems within personalized medicine and digital health rely on the analysis of continuous-time functional biomarkers and other complex data structures emerging from high-resolution patient monitoring. In this context, this work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset select… ▽ More

    Submitted 12 January, 2025; originally announced January 2025.

  4. arXiv:2501.00212  [pdf, other

    stat.ME

    Denoising Data with Measurement Error Using a Reproducing Kernel-based Diffusion Model

    Authors: Mingyang Yi, Marcos Matabuena, Ruoyu Wang

    Abstract: The ongoing technological revolution in measurement systems enables the acquisition of high-resolution samples in fields such as engineering, biology, and medicine. However, these observations are often subject to errors from measurement devices. Motivated by this challenge, we propose a denoising framework that employs diffusion models to generate denoised data whose distribution closely approxim… ▽ More

    Submitted 30 December, 2024; originally announced January 2025.

  5. arXiv:2410.00912  [pdf, other

    stat.AP

    Glucodensity Functional Profiles Outperform Traditional Continuous Glucose Monitoring Metrics

    Authors: Marcos Matabuena, Rahul Ghosal, Javier Enrique Aguilar, Robert Wagner, Carmen Fernández Merino, Juan Sánchez Castro, Vadim Zipunnikov, Jukka-Pekka Onnela, Francisco Gude

    Abstract: Continuous glucose monitoring (CGM) data has revolutionized the management of type 1 diabetes, particularly when integrated with insulin pumps to mitigate clinical events such as hypoglycemia. Recently, there has been growing interest in utilizing CGM devices in clinical studies involving healthy and diabetes populations. However, efficiently exploiting the high temporal resolution of CGM profiles… ▽ More

    Submitted 1 October, 2024; originally announced October 2024.

  6. arXiv:2409.02644  [pdf, ps, other

    stat.ML cs.LG q-bio.QM

    Conformal Prediction in Dynamic Biological Systems

    Authors: Alberto Portela, Julio R. Banga, Marcos Matabuena

    Abstract: Uncertainty quantification (UQ) is the process of systematically determining and characterizing the degree of confidence in computational model predictions. In the context of systems biology, especially with dynamic models, UQ is crucial because it addresses the challenges posed by nonlinearity and parameter sensitivity, allowing us to properly understand and extrapolate the behavior of complex bi… ▽ More

    Submitted 28 October, 2024; v1 submitted 4 September, 2024; originally announced September 2024.

  7. arXiv:2408.16381  [pdf, other

    stat.ME math.ST

    Uncertainty quantification for intervals

    Authors: Carlos García Meixide, Michael R. Kosorok, Marcos Matabuena

    Abstract: Data following an interval structure are increasingly prevalent in many scientific applications. In medicine, clinical events are often monitored between two clinical visits, making the exact time of the event unknown and generating outcomes with a range format. As interest in automating healthcare decisions grows, uncertainty quantification via predictive regions becomes essential for developing… ▽ More

    Submitted 30 March, 2025; v1 submitted 29 August, 2024; originally announced August 2024.

  8. arXiv:2406.19716  [pdf, other

    stat.ME stat.AP

    Functional Time Transformation Model with Applications to Digital Health

    Authors: Rahul Ghosal, Marcos Matabuena, Sujit K. Ghosh

    Abstract: The advent of wearable and sensor technologies now leads to functional predictors which are intrinsically infinite dimensional. While the existing approaches for functional data and survival outcomes lean on the well-established Cox model, the proportional hazard (PH) assumption might not always be suitable in real-world applications. Motivated by physiological signals encountered in digital medic… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  9. arXiv:2405.14690  [pdf, other

    q-bio.QM stat.AP

    Multilevel functional data analysis modeling of human glucose response to meal intake

    Authors: Marcos Matabuena, Joe Sartini, Francisco Gude

    Abstract: Glucose meal response information collected via Continuous Glucose Monitoring (CGM) is relevant to the assessment of individual metabolic status and the support of personalized diet prescriptions. However, the complexity of the data produced by CGM monitors pushes the limits of existing analytic methods. CGM data often exhibits substantial within-person variability and has a natural multilevel str… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  10. arXiv:2405.13970  [pdf, other

    stat.ME

    Conformal uncertainty quantification using kernel depth measures in separable Hilbert spaces

    Authors: Marcos Matabuena, Rahul Ghosal, Pavlo Mozharovskyi, Oscar Hernan Madrid Padilla, Jukka-Pekka Onnela

    Abstract: Depth measures have gained popularity in the statistical literature for defining level sets in complex data structures like multivariate data, functional data, and graphs. Despite their versatility, integrating depth measures into regression modeling for establishing prediction regions remains underexplored. To address this gap, we propose a novel method utilizing a model-free uncertainty quantifi… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  11. arXiv:2405.05110  [pdf, other

    math.ST stat.ML

    Uncertainty quantification in metric spaces

    Authors: Gábor Lugosi, Marcos Matabuena

    Abstract: This paper introduces a novel uncertainty quantification framework for regression models where the response takes values in a separable metric space, and the predictors are in a Euclidean space. The proposed algorithms can efficiently handle large datasets and are agnostic to the predictive base model used. Furthermore, the algorithms possess asymptotic consistency guarantees and, in some special… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

  12. arXiv:2404.09716  [pdf, other

    stat.ME stat.AP

    Optimal Cut-Point Estimation for Functional Digital Biomarkers: Application to Diabetes Risk Stratification via Continuous Glucose Monitoring

    Authors: Oscar Lado-Baleato, Carla Díaz-Louza, Francisco Gude, Marcos Matabuena

    Abstract: Establishing optimal cut-offs for clinical biomarkers is a fundamental statistical problem in epidemiology, clinical trials, and drug discovery. While there is extensive literature regarding the definition of optimal cut-offs for scalar biomarkers, methodologies for analyzing random statistical objects in the more complex spaces associated with random functions and graphs - something increasingly… ▽ More

    Submitted 9 March, 2025; v1 submitted 15 April, 2024; originally announced April 2024.

  13. arXiv:2403.19752  [pdf, other

    stat.ME stat.ML

    Screening for Diabetes Mellitus in the U.S. Population Using Neural Network Models and Complex Survey Designs

    Authors: Marcos Matabuena, Juan C. Vidal, Rahul Ghosal, Jukka-Pekka Onnela

    Abstract: Complex survey designs are commonly employed in many medical cohorts. In such scenarios, developing case-specific predictive risk score models that reflect the unique characteristics of the study design is essential for minimizing selective biases in the statistical results. The objectives of this paper are to: (i) propose a general predictive framework for regression and classification using neur… ▽ More

    Submitted 25 March, 2025; v1 submitted 28 March, 2024; originally announced March 2024.

  14. arXiv:2403.18069  [pdf, other

    stat.ME stat.AP

    Personalized Imputation in metric spaces via conformal prediction: Applications in Predicting Diabetes Development with Continuous Glucose Monitoring Information

    Authors: Marcos Matabuena, Carla Díaz-Louzao, Rahul Ghosal, Francisco Gude

    Abstract: The challenge of handling missing data is widespread in modern data analysis, particularly during the preprocessing phase and in various inferential modeling tasks. Although numerous algorithms exist for imputing missing data, the assessment of imputation quality at the patient level often lacks personalized statistical approaches. Moreover, there is a scarcity of imputation methods for metric spa… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  15. arXiv:2403.10514  [pdf, other

    stat.ME stat.AP

    Multilevel functional distributional models with application to continuous glucose monitoring in diabetes clinical trials

    Authors: Marcos Matabuena, Ciprian M. Crainiceanu

    Abstract: Continuous glucose monitoring (CGM) is a minimally invasive technology that allows continuous monitoring of an individual's blood glucose. We focus on a large clinical trial that collected CGM data every few minutes for 26 weeks and assumes that the basic observation unit is the distribution of CGM observations in a four-week interval. The resulting data structure is multilevel (because each indiv… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  16. arXiv:2402.01635  [pdf, other

    stat.ME cs.LG stat.CO stat.ML

    kNN Algorithm for Conditional Mean and Variance Estimation with Automated Uncertainty Quantification and Variable Selection

    Authors: Marcos Matabuena, Juan C. Vidal, Oscar Hernan Madrid Padilla, Jukka-Pekka Onnela

    Abstract: In this paper, we introduce a kNN-based regression method that synergizes the scalability and adaptability of traditional non-parametric kNN models with a novel variable selection technique. This method focuses on accurately estimating the conditional mean and variance of random response variables, thereby effectively characterizing conditional distributions across diverse scenarios.Our approach i… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

  17. arXiv:2310.10494  [pdf, other

    stat.ME stat.AP

    Multivariate Scalar on Multidimensional Distribution Regression

    Authors: Rahul Ghosal, Marcos Matabuena

    Abstract: We develop a new method for multivariate scalar on multidimensional distribution regression. Traditional approaches typically analyze isolated univariate scalar outcomes or consider unidimensional distributional representations as predictors. However, these approaches are sub-optimal because: i) they fail to utilize the dependence between the distributional predictors: ii) neglect the correlation… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

  18. arXiv:2302.07692  [pdf, other

    stat.ME stat.AP

    Predicting Distributions of Physical Activity Profiles in the NHANES Database Using a Partially Linear Fréchet Single Index Model

    Authors: Marcos Matabuena, Aritra Ghosal, Wendy Meiring, Alexander Petersen

    Abstract: Object-oriented data analysis is a fascinating and evolving field in modern statistical science, with the potential to make significant contributions to biomedical applications. This statistical framework facilitates the development of new methods to analyze complex data objects that capture more information than traditional clinical biomarkers. This paper applies the object-oriented framework to… ▽ More

    Submitted 9 March, 2025; v1 submitted 15 February, 2023; originally announced February 2023.

  19. arXiv:2302.07340  [pdf, other

    stat.ME stat.AP

    Functional proportional hazards mixture cure model and its application to modelling the association between cancer mortality and physical activity in NHANES 2003-2006

    Authors: Rahul Ghosal, Marcos Matabuena, Jiajia Zhang

    Abstract: We develop a functional proportional hazards mixture cure (FPHMC) model with scalar and functional covariates measured at the baseline. The mixture cure model, useful in studying populations with a cure fraction of a particular event of interest is extended to functional data. We employ the EM algorithm and develop a semiparametric penalized spline-based approach to estimate the dynamic functional… ▽ More

    Submitted 30 March, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

  20. arXiv:2208.03675  [pdf, other

    stat.ME math.ST stat.ML

    Kernel Biclustering algorithm in Hilbert Spaces

    Authors: Marcos Matabuena, J. C Vidal, Oscar Hernan Madrid Padilla, Dino Sejdinovic

    Abstract: Biclustering algorithms partition data and covariates simultaneously, providing new insights in several domains, such as analyzing gene expression to discover new biological functions. This paper develops a new model-free biclustering algorithm in abstract spaces using the notions of energy distance (ED) and the maximum mean discrepancy (MMD) -- two distances between probability distributions capa… ▽ More

    Submitted 7 August, 2022; originally announced August 2022.

  21. arXiv:2206.06885  [pdf, other

    stat.ML cs.LG stat.ME

    Neural interval-censored survival regression with feature selection

    Authors: Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok

    Abstract: Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high-dimensional datasets, such as omics and medical image data. However, the literature on non-linear regression algorithms and variable selection techniques for interval-censoring is either limited or non-e… ▽ More

    Submitted 22 August, 2024; v1 submitted 14 June, 2022; originally announced June 2022.

    Journal ref: Statistical Analysis and Data Mining: The ASA Data Science Journal 17.4 (2024):

  22. arXiv:2206.01590  [pdf, other

    stat.ME stat.ML

    Hypothesis testing for matched pairs with missing data by maximum mean discrepancy: An application to continuous glucose monitoring

    Authors: Marcos Matabuena, Paulo Félix, Marc Ditzhaus, Juan Vidal, Francisco Gude

    Abstract: A frequent problem in statistical science is how to properly handle missing data in matched paired observations. There is a large body of literature coping with the univariate case. Yet, the ongoing technological progress in measuring biological systems raises the need for addressing more complex data, e.g., graphs, strings and probability distributions, among others. In order to fill this gap, th… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

  23. arXiv:2104.01165  [pdf, other

    stat.ME stat.AP stat.OT

    Distributional data analysis of accelerometer data from the NHANES database using nonparametric survey regression models

    Authors: Marcos Matabuena, Alexander Petersen

    Abstract: Accelerometers enable an objective measurement of physical activity levels among groups of individuals in free-living environments, providing high-resolution detail about physical activity changes at different time scales. Current approaches used in the literature for analyzing such data typically employ summary measures such as total inactivity time or compositional metrics. However, at the conce… ▽ More

    Submitted 20 January, 2022; v1 submitted 2 April, 2021; originally announced April 2021.

  24. arXiv:2103.15704  [pdf, other

    stat.AP stat.ME stat.OT

    Are Multilevel functional models the next step in sports biomechanics and wearable technology? A case study of Knee Biomechanics patterns in typical training sessions of recreational runners

    Authors: Marcos Matabuena, Sherveen Riazati, Nick Caplan, Phil Hayes

    Abstract: This paper illustrates how multilevel functional models can detect and characterize biomechanical changes along different sport training sessions. Our analysis focuses on the relevant cases to identify differences in knee biomechanics in recreational runners during low and high-intensity exercise sessions with the same energy expenditure by recording $20$ steps. To do so, we review the existing li… ▽ More

    Submitted 5 April, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

  25. arXiv:2012.06564  [pdf, other

    stat.ML cs.LG q-bio.QM stat.AP stat.ME

    Glucose values prediction five years ahead with a new framework of missing responses in reproducing kernel Hilbert spaces, and the use of continuous glucose monitoring technology

    Authors: Marcos Matabuena, Paulo Félix, Carlos Meijide-Garcia, Francisco Gude

    Abstract: AEGIS study possesses unique information on longitudinal changes in circulating glucose through continuous glucose monitoring technology (CGM). However, as usual in longitudinal medical studies, there is a significant amount of missing data in the outcome variables. For example, 40 percent of glycosylated hemoglobin (A1C) biomarker data are missing five years ahead. With the purpose to reduce the… ▽ More

    Submitted 14 December, 2020; v1 submitted 11 December, 2020; originally announced December 2020.

  26. arXiv:2008.07840  [pdf, other

    stat.AP q-bio.QM stat.OT

    Glucodensities: a new representation of glucose profiles using distributional data analysis

    Authors: Marcos Matabuena, Alexander Petersen, Juan C. Vidal, Francisco Gude

    Abstract: Biosensor data has the potential ability to improve disease control and detection. However, the analysis of these data under free-living conditions is not feasible with current statistical techniques. To address this challenge, we introduce a new functional representation of biosensor data, termed the glucodensity, together with a data analysis framework based on distances between them. The new da… ▽ More

    Submitted 18 August, 2020; originally announced August 2020.

  27. arXiv:2004.13695  [pdf, other

    q-bio.PE q-bio.QM stat.AP

    COVID-19: Estimating spread in Spain solving an inverse problem with a probabilistic model

    Authors: Marcos Matabuena, Carlos Meijide-García, Pablo Rodríguez-Mier, Víctor Leborán

    Abstract: We introduce a new probabilistic model to estimate the real spread of the novel SARS-CoV-2 virus along regions or countries. Our model simulates the behavior of each individual in a population according to a probabilistic model through an inverse problem; we estimate the real number of recovered and infected people using mortality records. In addition, the model is dynamic in the sense that it tak… ▽ More

    Submitted 3 May, 2020; v1 submitted 28 April, 2020; originally announced April 2020.

    Comments: 36 pag

  28. arXiv:1912.04160  [pdf, other

    stat.ME

    Energy distance and kernel mean embeddings for two-sample survival testing

    Authors: Marcos Matabuena, Oscar Hernan Madrid Padilla

    Abstract: We study the comparison problem of distribution equality between two random samples under a right censoring scheme. To address this problem, we design a series of tests based on energy distance and kernel mean embeddings. We calibrate our tests using permutation methods and prove that they are consistent against all fixed continuous alternatives. To evaluate our proposed tests, we simulate surviva… ▽ More

    Submitted 9 December, 2019; originally announced December 2019.

  29. arXiv:1901.00833  [pdf, ps, other

    math.ST stat.ME

    Energy distance and kernel mean embedding for two sample survival test

    Authors: Marcos Matabuena

    Abstract: In this article a new family of tests is proposed for the comparison problem of the equality of distribution of two-sample under right censoring scheme. The tests are based on energy distance and kernels mean embedding, are calibrated by permutations and are consistent against all alternatives. The good performance of the new tests in real situations with finite samples is established with a simul… ▽ More

    Submitted 3 January, 2019; originally announced January 2019.