-
Continuous-time multivariate analysis
Authors:
Biplab Paul,
Philip T. Reiss,
Erjia Cui,
Noemi Foà
Abstract:
The starting point for much of multivariate analysis (MVA) is an $n\times p$ data matrix whose $n$ rows represent observations and whose $p$ columns represent variables. Some multivariate data sets, however, may be best conceptualized not as $n$ discrete $p$-variate observations, but as $p$ curves or functions defined on a common time interval. Here we introduce a framework for extending technique…
▽ More
The starting point for much of multivariate analysis (MVA) is an $n\times p$ data matrix whose $n$ rows represent observations and whose $p$ columns represent variables. Some multivariate data sets, however, may be best conceptualized not as $n$ discrete $p$-variate observations, but as $p$ curves or functions defined on a common time interval. Here we introduce a framework for extending techniques of multivariate analysis to such settings. The proposed continuous-time multivariate analysis (CTMVA) framework rests on the assumption that the curves can be represented as linear combinations of basis functions such as $B$-splines, as in the Ramsay-Silverman representation of functional data; but whereas functional data analysis extends MVA to the case of observations that are curves rather than vectors -- heuristically, $n\times p$ data with $p$ infinite -- we are instead concerned with what happens when $n$ is infinite. We present continuous-time extensions of the classical MVA methods of covariance and correlation estimation, principal component analysis, Fisher's linear discriminant analysis, and $k$-means clustering. We show that CTMVA can improve on the performance of classical MVA, in particular for correlation estimation and clustering, and can be applied in some settings where classical MVA cannot, including variables observed at disparate time points. CTMVA is illustrated with a novel perspective on a well-known Canadian weather data set, and with applications to data sets involving international development, brain signals, and air quality. The proposed methods are implemented in the publicly available R package \texttt{ctmva}.
△ Less
Submitted 12 June, 2024; v1 submitted 18 July, 2023;
originally announced July 2023.
-
Bayesian mixture models for phylogenetic source attribution from consensus sequences and time since infection estimates
Authors:
Alexandra Blenkinsop,
Lysandros Sofocleous,
Francesco Di Lauro,
Evangelia Georgia Kostaki,
Ard van Sighem,
Daniela Bezemer,
Thijs van de Laar,
Peter Reiss,
Godelieve de Bree,
Nikos Pantazis,
Oliver Ratmann
Abstract:
In stopping the spread of infectious diseases, pathogen genomic data can be used to reconstruct transmission events and characterize population-level sources of infection. Most approaches for identifying transmission pairs do not account for the time passing since divergence of pathogen variants in individuals, which is problematic in viruses with high within-host evolutionary rates. This prompted…
▽ More
In stopping the spread of infectious diseases, pathogen genomic data can be used to reconstruct transmission events and characterize population-level sources of infection. Most approaches for identifying transmission pairs do not account for the time passing since divergence of pathogen variants in individuals, which is problematic in viruses with high within-host evolutionary rates. This prompted us to consider possible transmission pairs in terms of phylogenetic data and additional estimates of time since infection derived from clinical biomarkers. We develop Bayesian mixture models with an evolutionary clock as signal component and additional mixed effects or covariate random functions describing the mixing weights to classify potential pairs into likely and unlikely transmission pairs. We demonstrate that although sources cannot be identified at the individual level with certainty, even with the additional data on time elapsed, inferences into the population-level sources of transmission are possible, and more accurate than using only phylogenetic data without time since infection estimates. We apply the approach to estimate age-specific sources of HIV infection in Amsterdam MSM transmission networks between 2010-2021. This study demonstrates that infection time estimates provide informative data to characterize transmission sources, and shows how phylogenetic source attribution can then be done with multi-dimensional mixture models.
△ Less
Submitted 22 August, 2024; v1 submitted 13 April, 2023;
originally announced April 2023.
-
Estimating the potential to prevent locally acquired HIV infections in a UNAIDS Fast-Track City, Amsterdam
Authors:
Alexandra Blenkinsop,
Mélodie Monod,
Ard van Sighem,
Nikos Pantazis,
Daniela Bezemer,
Eline Op de Coul,
Thijs van de Laar,
Christophe Fraser,
Maria Prins,
Peter Reiss,
Godelieve de Bree,
Oliver Ratmann
Abstract:
Amsterdam and other UNAIDS Fast-Track cities aim for zero new HIV infections. Utilising molecular and clinical data of the ATHENA observational HIV cohort, our primary aims are to estimate the proportion of undiagnosed HIV infections and the proportion of locally acquired infections in Amsterdam in 2014-2018, both in MSM and heterosexuals and Dutch-born and foreign-born individuals.
We located d…
▽ More
Amsterdam and other UNAIDS Fast-Track cities aim for zero new HIV infections. Utilising molecular and clinical data of the ATHENA observational HIV cohort, our primary aims are to estimate the proportion of undiagnosed HIV infections and the proportion of locally acquired infections in Amsterdam in 2014-2018, both in MSM and heterosexuals and Dutch-born and foreign-born individuals.
We located diagnosed HIV infections in Amsterdam using postcode data at time of registration to the cohort, and estimated their date of infection using clinical HIV data. We then inferred the proportion undiagnosed from the estimated times to diagnosis. To determine sources of Amsterdam infections, we used HIV sequences of people living with HIV (PLHIV) within a background of other Dutch and international sequences to phylogenetically reconstruct transmission chains. Frequent late diagnoses indicate that more recent phylogenetically observed chains are increasingly incomplete, and we use a Bayesian model to estimate the actual growth of Amsterdam transmission chains, and the proportion of locally acquired infections.
We estimate that 20% [95% CrI 18-22%] of infections acquired among MSM between 2014-2018 were undiagnosed by the start of 2019, and 44% [37-50%] among heterosexuals, with variation by place of birth. The estimated proportion of MSM infections in 2014-2018 that were locally acquired was 68% [61-74%], with no substantial differences by region of birth. In heterosexuals, this was 57% [41-71%] overall, with heterogeneity by place of birth.
The data indicate substantial potential to further curb local transmission, in both MSM and heterosexual Amsterdam residents. In 2014-2018 the largest proportion of local transmissions in Amsterdam are estimated to have occurred in foreign-born MSM, who would likely benefit most from intensified interventions.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
Generalized reliability based on distances
Authors:
Meng Xu,
Philip T. Reiss,
Ivor Cribben
Abstract:
The intraclass correlation coefficient (ICC) is a classical index of measurement reliability. With the advent of new and complex types of data for which the ICC is not defined, there is a need for new ways to assess reliability. To meet this need, we propose a new distance-based intraclass correlation coefficient (dbICC), defined in terms of arbitrary distances among observations. We introduce a b…
▽ More
The intraclass correlation coefficient (ICC) is a classical index of measurement reliability. With the advent of new and complex types of data for which the ICC is not defined, there is a need for new ways to assess reliability. To meet this need, we propose a new distance-based intraclass correlation coefficient (dbICC), defined in terms of arbitrary distances among observations. We introduce a bias correction to improve the coverage of bootstrap confidence intervals for the dbICC, and demonstrate its efficacy via simulation. We illustrate the proposed method by analyzing the test-retest reliability of brain connectivity matrices derived from a set of repeated functional magnetic resonance imaging scans. The Spearman-Brown formula, which shows how more intensive measurement increases reliability, is extended to encompass the dbICC.
△ Less
Submitted 17 January, 2020; v1 submitted 15 December, 2019;
originally announced December 2019.
-
Distribution-Free Pointwise Adjusted P-Values for Functional Hypotheses
Authors:
Meng Xu,
Philip T. Reiss
Abstract:
Graphical tests assess whether a function of interest departs from an envelope of functions generated under a simulated null distribution. This approach originated in spatial statistics, but has recently gained some popularity in functional data analysis. Whereas such envelope tests examine deviation from a functional null distribution in an omnibus sense, in some applications we wish to do more:…
▽ More
Graphical tests assess whether a function of interest departs from an envelope of functions generated under a simulated null distribution. This approach originated in spatial statistics, but has recently gained some popularity in functional data analysis. Whereas such envelope tests examine deviation from a functional null distribution in an omnibus sense, in some applications we wish to do more: to obtain p-values at each point in the function domain, adjusted to control the familywise error rate. Here we derive pointwise adjusted p-values based on envelope tests, and relate these to previous approaches for functional data under distributional assumptions. We then present two alternative distribution-free p-value adjustments that offer greater power. The methods are illustrated with an analysis of age-varying sex effects on cortical thickness in the human brain.
△ Less
Submitted 1 December, 2019;
originally announced December 2019.
-
Wavelet-domain regression and predictive inference in psychiatric neuroimaging
Authors:
Philip T. Reiss,
Lan Huo,
Yihong Zhao,
Clare Kelly,
R. Todd Ogden
Abstract:
An increasingly important goal of psychiatry is the use of brain imaging data to develop predictive models. Here we present two contributions to statistical methodology for this purpose. First, we propose and compare a set of wavelet-domain procedures for fitting generalized linear models with scalar responses and image predictors: sparse variants of principal component regression and of partial l…
▽ More
An increasingly important goal of psychiatry is the use of brain imaging data to develop predictive models. Here we present two contributions to statistical methodology for this purpose. First, we propose and compare a set of wavelet-domain procedures for fitting generalized linear models with scalar responses and image predictors: sparse variants of principal component regression and of partial least squares, and the elastic net. Second, we consider assessing the contribution of image predictors over and above available scalar predictors, in particular, via permutation tests and an extension of the idea of confounding to the case of functional or image predictors. Using the proposed methods, we assess whether maps of a spontaneous brain activity measure, derived from functional magnetic resonance imaging, can meaningfully predict presence or absence of attention deficit/hyperactivity disorder (ADHD). Our results shed light on the role of confounding in the surprising outcome of the recent ADHD-200 Global Competition, which challenged researchers to develop algorithms for automated image-based diagnosis of the disorder.
△ Less
Submitted 16 September, 2015;
originally announced September 2015.
-
Varying-smoother models for functional responses
Authors:
Philip T. Reiss,
Lei Huang,
Huaihou Chen,
Stan Colcombe
Abstract:
This paper studies estimation of a smooth function $f(t,s)$ when we are given functional responses of the form $f(t,\cdot)$ + error, but scientific interest centers on the collection of functions $f(\cdot,s)$ for different $s$. The motivation comes from studies of human brain development, in which $t$ denotes age whereas $s$ refers to brain locations. Analogously to varying-coefficient models, in…
▽ More
This paper studies estimation of a smooth function $f(t,s)$ when we are given functional responses of the form $f(t,\cdot)$ + error, but scientific interest centers on the collection of functions $f(\cdot,s)$ for different $s$. The motivation comes from studies of human brain development, in which $t$ denotes age whereas $s$ refers to brain locations. Analogously to varying-coefficient models, in which the mean response is linear in $t$, the "varying-smoother" models that we consider exhibit nonlinear dependence on $t$ that varies smoothly with $s$. We discuss three approaches to estimating varying-smoother models: (a) methods that employ a tensor product penalty; (b) an approach based on smoothed functional principal component scores; and (c) two-step methods consisting of an initial smooth with respect to $t$ at each $s$, followed by a postprocessing step. For the first approach, we derive an exact expression for a penalty proposed by Wood, and an adaptive penalty that allows smoothness to vary more flexibly with $s$. We also develop "pointwise degrees of freedom," a new tool for studying the complexity of estimates of $f(\cdot,s)$ at each $s$. The three approaches to varying-smoother models are compared in simulations and with a diffusion tensor imaging data set.
△ Less
Submitted 1 December, 2014;
originally announced December 2014.
-
Optimal Kernel Combination for Test of Independence against Local Alternatives
Authors:
Wen-Yu Hua,
Philip Reiss,
Debashis Ghosh
Abstract:
Testing the independence between two random variables $x$ and $y$ is an important problem in statistics and machine learning, where the kernel-based tests of independence is focused to address the study of dependence recently. The advantage of the kernel framework rests on its flexibility in choice of kernel. The Hilbert-Schmidt Independence Criterion (HSIC) was shown to be equivalent to a class o…
▽ More
Testing the independence between two random variables $x$ and $y$ is an important problem in statistics and machine learning, where the kernel-based tests of independence is focused to address the study of dependence recently. The advantage of the kernel framework rests on its flexibility in choice of kernel. The Hilbert-Schmidt Independence Criterion (HSIC) was shown to be equivalent to a class of tests, where the tests are based on different distance-induced kernel pairs. In this work, we propose to select the optimal kernel pair by considering local alternatives, and evaluate the efficiency using the quadratic time estimator of HSIC. The local alternative offers the advantage that the measure of efficiency do not depend on a particular alternative, and only requires the knowledge of the asymptotic null distribution of the test. We show in our experiments that the proposed strategy results in higher power than other existing kernel selection approaches.
△ Less
Submitted 12 April, 2015; v1 submitted 11 September, 2014;
originally announced September 2014.