-
Principal Subsimplex Analysis
Authors:
Hyeon Lee,
Kassel Liam Hingee,
Janice L. Scealy,
Andrew T. A. Wood,
Eric Grunsky,
J. S. Marron
Abstract:
Compositional data, also referred to as simplicial data, naturally arise in many scientific domains such as geochemistry, microbiology, and economics. In such domains, obtaining sensible lower-dimensional representations and modes of variation plays an important role. A typical approach to the problem is applying a log-ratio transformation followed by principal component analysis (PCA). However, t…
▽ More
Compositional data, also referred to as simplicial data, naturally arise in many scientific domains such as geochemistry, microbiology, and economics. In such domains, obtaining sensible lower-dimensional representations and modes of variation plays an important role. A typical approach to the problem is applying a log-ratio transformation followed by principal component analysis (PCA). However, this approach has several well-known weaknesses: it amplifies variation in minor variables; it can obscure important variation within major elements; it is not directly applicable to data sets containing zeros and zero imputation methods give highly variable results; it has limited ability to capture linear patterns present in compositional data. In this paper, we propose novel methods that produce nested sequences of simplices of decreasing dimensions analogous to backwards principal component analysis. These nested sequences offer both interpretable lower dimensional representations and linear modes of variation. In addition, our methods are applicable to data sets contain zeros without any modification. We demonstrate our methods on simulated data and on relative abundances of diatom species during the late Pliocene. Supplementary materials and R implementations for this article are available online.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
Regularization and Selection in A Directed Network Model with Nodal Homophily and Nodal Effects
Authors:
Zhaoyu Xing,
Y. X. Rachel Wang,
Andrew T. A. Wood,
Tao Zou
Abstract:
This article introduces a regularization and selection methods for directed networks with nodal homophily and nodal effects. The proposed approach not only preserves the statistical efficiency of the resulting estimator, but also ensures that the selection of nodal homophily and nodal effects is scalable with large-scale network data and multiple nodal features. In particular, we propose a directe…
▽ More
This article introduces a regularization and selection methods for directed networks with nodal homophily and nodal effects. The proposed approach not only preserves the statistical efficiency of the resulting estimator, but also ensures that the selection of nodal homophily and nodal effects is scalable with large-scale network data and multiple nodal features. In particular, we propose a directed random network model with nodal homophily and nodal effects, which includes the nodal features in the probability density of random networks. Subsequently, we propose a regularized maximum likelihood estimator with an adaptive LASSO-type regularizer. We demonstrate that the regularized estimator exhibits the consistency and possesses the oracle properties. In addition, we propose a network Bayesian information criterion which ensures the selection consistency while tuning the model. Simulation experiments are conducted to demonstrate the excellent numerical performance. An online friendship network among musicians with nodal musical preference is used to illustrate the usefulness of the proposed new network model in network-related empirical analysis.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
A Robust Extrinsic Single-index Model for Spherical Data
Authors:
Houren Hong,
Janice L. Scealy,
Andrew T. A. Wood,
Yanrong Yang
Abstract:
Regression with a spherical response is challenging due to the absence of linear structure, making standard regression models inadequate. Existing methods, mainly parametric, lack the flexibility to capture the complex relationship induced by spherical curvature, while methods based on techniques from Riemannian geometry often suffer from computational difficulties. The non-Euclidean structure fur…
▽ More
Regression with a spherical response is challenging due to the absence of linear structure, making standard regression models inadequate. Existing methods, mainly parametric, lack the flexibility to capture the complex relationship induced by spherical curvature, while methods based on techniques from Riemannian geometry often suffer from computational difficulties. The non-Euclidean structure further complicates robust estimation, with very limited work addressing this issue, despite the common presence of outliers in directional data. This article introduces a new semi-parametric approach, the extrinsic single-index model (ESIM) and its robust estimation, to address these limitations. We establish large-sample properties of the proposed estimator with a wide range of loss functions and assess their robustness using the influence function and standardized influence function. Specifically, we focus on the robustness of the exponential squared loss (ESL), demonstrating comparable efficiency and superior robustness over least squares loss under high concentration. We also examine how the tuning parameter for the ESL balances efficiency and robustness, providing guidance on its optimal choice. The computational efficiency and robustness of our methods are further illustrated via simulations and applications to geochemical compositional data.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Change Point Detection for Random Objects with Possibly Periodic Behavior
Authors:
Jiazhen Xu,
Andrew T. A. Wood,
Tao Zou
Abstract:
Time-varying random objects have been increasingly encountered in modern data analysis. Moreover, in a substantial number of these applications, periodic behavior of the random objects has been observed. We introduce a new, powerful scan statistic and corresponding test for the precise identification and localization of abrupt changes in the distribution of non-Euclidean random objects with possib…
▽ More
Time-varying random objects have been increasingly encountered in modern data analysis. Moreover, in a substantial number of these applications, periodic behavior of the random objects has been observed. We introduce a new, powerful scan statistic and corresponding test for the precise identification and localization of abrupt changes in the distribution of non-Euclidean random objects with possibly periodic behavior. Our approach is nonparametric and effectively captures the entire distribution of these random objects. Remarkably, it operates with minimal tuning parameters, requiring only the specification of cut-off intervals near endpoints, where change points are assumed not to occur. Our theoretical contributions include deriving the asymptotic distribution of the test statistic under the null hypothesis of no change points, establishing the consistency of the test in the presence of change points under contiguous alternatives and providing rigorous guarantees on the near-optimal consistency in estimating the number and locations of change points, whether dealing with a single change point or multiple ones. We demonstrate that the most competitive method currently in the literature for change point detection in random objects is degraded by periodic behavior, as periodicity leads to blurring of the changes that this procedure aims to discover. Through comprehensive simulation studies, we demonstrate the superior power and accuracy of our approach in both detecting change points and pinpointing their locations, across scenarios involving both periodic and nonperiodic random objects. Our main application is to weighted networks, represented through graph Laplacians. The proposed method delivers highly interpretable results, as evidenced by the identification of meaningful change points in the New York City Citi Bike sharing system that align with significant historical events.
△ Less
Submitted 3 January, 2025;
originally announced January 2025.
-
Empirical likelihood for Fréchet means on open books
Authors:
Karthik Bharath,
Huiling Le,
Andrew T A Wood,
Xi Yan
Abstract:
Empirical Likelihood (EL) is a type of nonparametric likelihood that is useful in many statistical inference problems, including confidence region construction and $k$-sample problems. It enjoys some remarkable theoretical properties, notably Bartlett correctability. One area where EL has potential but is under-developed is in non-Euclidean statistics where the Fréchet mean is the population chara…
▽ More
Empirical Likelihood (EL) is a type of nonparametric likelihood that is useful in many statistical inference problems, including confidence region construction and $k$-sample problems. It enjoys some remarkable theoretical properties, notably Bartlett correctability. One area where EL has potential but is under-developed is in non-Euclidean statistics where the Fréchet mean is the population characteristic of interest. Only recently has a general EL method been proposed for smooth manifolds. In this work, we continue progress in this direction and develop an EL method for the Fréchet mean on a stratified metric space that is not a manifold: the open book, obtained by gluing copies of a Euclidean space along their common boundaries. The structure of an open book captures the essential behaviour of the Fréchet mean around certain singular regions of more general stratified spaces for complex data objects, and relates intimately to the local geometry of non-binary trees in the well-studied phylogenetic treespace. We derive a version of Wilks' theorem for the EL statistic, and elucidate on the delicate interplay between the asymptotic distribution and topology of the neighbourhood around the population Fréchet mean. We then present a bootstrap calibration of the EL, which proves that under mild conditions, bootstrap calibration of EL confidence regions have coverage error of size $O(n^{-2})$ rather than $O(n^{-1})$.
△ Less
Submitted 25 December, 2024;
originally announced December 2024.
-
Robust Functional Principal Component Analysis for Non-Euclidean Random Objects
Authors:
Jiazhen Xu,
Andrew T. A. Wood,
Tao Zou
Abstract:
Functional data analysis offers a diverse toolkit of statistical methods tailored for analyzing samples of real-valued random functions. Recently, samples of time-varying random objects, such as time-varying networks, have been increasingly encountered in modern data analysis. These data structures represent elements within general metric spaces that lack local or global linear structures, renderi…
▽ More
Functional data analysis offers a diverse toolkit of statistical methods tailored for analyzing samples of real-valued random functions. Recently, samples of time-varying random objects, such as time-varying networks, have been increasingly encountered in modern data analysis. These data structures represent elements within general metric spaces that lack local or global linear structures, rendering traditional functional data analysis methods inapplicable. Moreover, the existing methodology for time-varying random objects does not work well in the presence of outlying objects. In this paper, we propose a robust method for analysing time-varying random objects. Our method employs pointwise Fréchet medians and then constructs pointwise distance trajectories between the individual time courses and the sample Fréchet medians. This representation effectively transforms time-varying objects into functional data. A novel robust approach to functional principal component analysis based on a Winsorized U-statistic estimator of the covariance structure is introduced. The proposed robust analysis of these distance trajectories is able to identify key features of time-varying objects and is useful for downstream analysis. To illustrate the efficacy of our approach, numerical studies focusing on dynamic networks are conducted. The results indicate that the proposed method exhibits good all-round performance and surpasses the existing approach in terms of robustness, showcasing its superior performance in handling time-varying objects data.
△ Less
Submitted 6 March, 2025; v1 submitted 28 November, 2023;
originally announced December 2023.
-
A branch cut approach to the probability density and distribution functions of a linear combination of central and non-central Chi-square random variables
Authors:
Alfred Kume,
Tomonari Sei,
Andrew T. A. Wood
Abstract:
The paper considers the distribution of a general linear combination of central and non-central chi-square random variables by exploring the branch cut regions that appear in the standard Laplace inversion process. Due to the original interest from the directional statistics, the focus of this paper is on the density function of such distributions and not on their cumulative distribution function.…
▽ More
The paper considers the distribution of a general linear combination of central and non-central chi-square random variables by exploring the branch cut regions that appear in the standard Laplace inversion process. Due to the original interest from the directional statistics, the focus of this paper is on the density function of such distributions and not on their cumulative distribution function. In fact, our results confirm that the latter is a special case of the former. Our approach provides new insight by generating alternative characterizations of the probability density function in terms of a finite number of feasible univariate integrals. In particular, the central cases seem to allow an interesting representation in terms of the branch cuts, while general degrees of freedom and non-centrality can be easily adopted using recursive differentiation. Numerical results confirm that the proposed approach works well while more transparency and therefore easier control in the accuracy is ensured.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Robust score matching for compositional data
Authors:
Janice L. Scealy,
Kassel L. Hingee,
John T. Kent,
Andrew T. A. Wood
Abstract:
The restricted polynomially-tilted pairwise interaction (RPPI) distribution gives a flexible model for compositional data. It is particularly well-suited to situations where some of the marginal distributions of the components of a composition are concentrated near zero, possibly with right skewness. This article develops a method of tractable robust estimation for the model by combining two ideas…
▽ More
The restricted polynomially-tilted pairwise interaction (RPPI) distribution gives a flexible model for compositional data. It is particularly well-suited to situations where some of the marginal distributions of the components of a composition are concentrated near zero, possibly with right skewness. This article develops a method of tractable robust estimation for the model by combining two ideas. The first idea is to use score matching estimation after an additive log-ratio transformation. The resulting estimator is automatically insensitive to zeros in the data compositions. The second idea is to incorporate suitable weights in the estimating equations. The resulting estimator is additionally resistant to outliers. These properties are confirmed in simulation studies where we further also demonstrate that our new outlier-robust estimator is efficient in high concentration settings, even in the case when there is no model contamination. An example is given using microbiome data. A user-friendly R package accompanies the article.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Generalized Score Matching
Authors:
Jiazhen Xu,
Janice L. Scealy,
Andrew T. A. Wood,
Tao Zou
Abstract:
Score matching is an estimation procedure that has been developed for statistical models whose probability density function is known up to proportionality but whose normalizing constant is intractable, so that maximum likelihood is difficult or impossible to implement. To date, applications of score matching have focused more on continuous IID models. Motivated by various data modelling problems,…
▽ More
Score matching is an estimation procedure that has been developed for statistical models whose probability density function is known up to proportionality but whose normalizing constant is intractable, so that maximum likelihood is difficult or impossible to implement. To date, applications of score matching have focused more on continuous IID models. Motivated by various data modelling problems, this article proposes a unified asymptotic theory of generalized score matching developed under the independence assumption, covering both continuous and discrete response data, thereby giving a sound basis for score-matchingbased inference. Real data analyses and simulation studies provide convincing evidence of strong practical performance of the proposed methods.
△ Less
Submitted 21 April, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Cauchy robust principal component analysis with applications to high-deimensional data sets
Authors:
Ayisha Fayomi,
Yannis Pantazis,
Michail Tsagris,
Andrew T. A. Wood
Abstract:
Principal component analysis (PCA) is a standard dimensionality reduction technique used in various research and applied fields. From an algorithmic point of view, classical PCA can be formulated in terms of operations on a multivariate Gaussian likelihood. As a consequence of the implied Gaussian formulation, the principal components are not robust to outliers. In this paper, we propose a modifie…
▽ More
Principal component analysis (PCA) is a standard dimensionality reduction technique used in various research and applied fields. From an algorithmic point of view, classical PCA can be formulated in terms of operations on a multivariate Gaussian likelihood. As a consequence of the implied Gaussian formulation, the principal components are not robust to outliers. In this paper, we propose a modified formulation, based on the use of a multivariate Cauchy likelihood instead of the Gaussian likelihood, which has the effect of robustifying the principal components. We present an algorithm to compute these robustified principal components. We additionally derive the relevant influence function of the first component and examine its theoretical properties. Simulation experiments on high-dimensional datasets demonstrate that the estimated principal components based on the Cauchy likelihood outperform or are on par with existing robust PCA techniques.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
Generalized Score Matching for Regression
Authors:
Jiazhen Xu,
Janice L. Scealy,
Andrew T. A. Wood,
Tao Zou
Abstract:
Many probabilistic models that have an intractable normalizing constant may be extended to contain covariates. Since the evaluation of the exact likelihood is difficult or even impossible for these models, score matching was proposed to avoid explicit computation of the normalizing constant. In the literature, score matching has so far only been developed for models in which the observations are i…
▽ More
Many probabilistic models that have an intractable normalizing constant may be extended to contain covariates. Since the evaluation of the exact likelihood is difficult or even impossible for these models, score matching was proposed to avoid explicit computation of the normalizing constant. In the literature, score matching has so far only been developed for models in which the observations are independent and identically distributed (IID). However, the IID assumption does not hold in the traditional fixed design setting for regression-type models. To deal with the estimation of these covariate-dependent models, this paper presents a new score matching approach for independent but not necessarily identically distributed data under a general framework for both continuous and discrete responses, which includes a novel generalized score matching method for count response regression. We prove that our proposed score matching estimators are consistent and asymptotically normal under mild regularity conditions. The theoretical results are supported by simulation studies and a real-data example. Additionally, our simulation results indicate that, compared to approximate maximum likelihood estimation, the generalized score matching produces estimates with substantially smaller biases in an application to doctoral publication data.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Score matching for compositional distributions
Authors:
Janice L. Scealy,
Andrew T. A. Wood
Abstract:
Compositional data and multivariate count data with known totals are challenging to analyse due to the non-negativity and sum-to-one constraints on the sample space. It is often the case that many of the compositional components are highly right-skewed, with large numbers of zeros. A major limitation of currently available estimators for compositional models is that they either cannot handle many…
▽ More
Compositional data and multivariate count data with known totals are challenging to analyse due to the non-negativity and sum-to-one constraints on the sample space. It is often the case that many of the compositional components are highly right-skewed, with large numbers of zeros. A major limitation of currently available estimators for compositional models is that they either cannot handle many zeros in the data or are not computationally feasible in moderate to high dimensions. We derive a new set of novel score matching estimators applicable to distributions on a Riemannian manifold with boundary, of which the standard simplex is a special case. The score matching method is applied to estimate the parameters in a new flexible truncation model for compositional data and we show that the estimators are scalable and available in closed form. Through extensive simulation studies, the scoring methodology is demonstrated to work well for estimating the parameters in the new truncation model and also for the Dirichlet distribution. We apply the new model and estimators to real microbiome compositional data and show that the model provides a good fit to the data.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Gaussian asymptotic limits for the $α$-transformation in the analysis of compositional data
Authors:
Yannis Pantazis,
Michail Tsagris,
Andrew T. A. Wood
Abstract:
Compositional data consists of vectors of proportions whose components sum to 1. Such vectors lie in the standard simplex, which is a manifold with boundary. One issue that has been rather controversial within the field of compositional data analysis is the choice of metric on the simplex. One popular possibility has been to use the metric implied by logtransforming the data, as proposed by Aitchi…
▽ More
Compositional data consists of vectors of proportions whose components sum to 1. Such vectors lie in the standard simplex, which is a manifold with boundary. One issue that has been rather controversial within the field of compositional data analysis is the choice of metric on the simplex. One popular possibility has been to use the metric implied by logtransforming the data, as proposed by Aitchison [1, 2]; and another popular approach has been to use the standard Euclidean metric inherited from the ambient space. Tsagris et al. [21] proposed a one-parameter family of power transformations, the $α$-transformations, which include both the metric implied by Aitchison's transformation and the Euclidean metric as particular cases. Our underlying philosophy is that, with many datasets, it may make sense to use the data to help us determine a suitable metric. A related possibility is to apply the $α$-transformations to a parametric family of distributions, and then estimate a along with the other parameters. However, as we shall see, when one follows this last approach with the Dirichlet family, some care is needed in a certain limiting case which arises $(α\neq 0)$, as we found out when fitting this model to real and simulated data. Specifically, when the maximum likelihood estimator of a is close to 0, the other parameters tend to be large. The main purpose of the paper is to study this limiting case both theoretically and numerically and to provide insight into these numerical findings.
△ Less
Submitted 21 February, 2019; v1 submitted 29 November, 2018;
originally announced December 2018.
-
The extended power distribution: A new distribution on $(0, 1)$
Authors:
Chibueze E. Ogbonnaya,
Simon P. Preston,
Andrew T. A. Wood
Abstract:
We propose a two-parameter bounded probability distribution called the extended power distribution. This distribution on $(0, 1)$ is similar to the beta distribution, however there are some advantages which we explore. We define the moments and quantiles of this distribution and show that it is possible to give an $r$-parameter extension of this distribution ($r>2$). We also consider its complemen…
▽ More
We propose a two-parameter bounded probability distribution called the extended power distribution. This distribution on $(0, 1)$ is similar to the beta distribution, however there are some advantages which we explore. We define the moments and quantiles of this distribution and show that it is possible to give an $r$-parameter extension of this distribution ($r>2$). We also consider its complementary distribution and show that it has some flexibility advantages over the Kumaraswamy and beta distributions. This distribution can be used as an alternative to the Kumaraswamy distribution since it has a closed form for its cumulative function. However, it can be fitted to data where there are some samples that are exactly equal to 1, unlike the Kumaraswamy and beta distributions which cannot be fitted to such data or may require some censoring. Applications considered show the extended power distribution performs favourably against the Kumaraswamy distribution in most cases.
△ Less
Submitted 7 November, 2017;
originally announced November 2017.
-
Nonparametric hypothesis testing for equality of means on the simplex
Authors:
Michail Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
In the context of data that lie on the simplex, we investigate use of empirical and exponential empirical likelihood, and Hotelling and James statistics, to test the null hypothesis of equal population means based on two independent samples. We perform an extensive numerical study using data simulated from various distributions on the simplex. The results, taken together with practical considerati…
▽ More
In the context of data that lie on the simplex, we investigate use of empirical and exponential empirical likelihood, and Hotelling and James statistics, to test the null hypothesis of equal population means based on two independent samples. We perform an extensive numerical study using data simulated from various distributions on the simplex. The results, taken together with practical considerations regarding implementation, support the use of bootstrap-calibrated James statistic.
△ Less
Submitted 4 August, 2016; v1 submitted 27 July, 2016;
originally announced July 2016.
-
Improved classification for compositional data using the $α$-transformation
Authors:
Michail Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
In compositional data analysis an observation is a vector containing non-negative values, only the relative sizes of which are considered to be of interest. Without loss of generality, a compositional vector can be taken to be a vector of proportions that sum to one. Data of this type arise in many areas including geology, archaeology, biology, economics and political science. In this paper we inv…
▽ More
In compositional data analysis an observation is a vector containing non-negative values, only the relative sizes of which are considered to be of interest. Without loss of generality, a compositional vector can be taken to be a vector of proportions that sum to one. Data of this type arise in many areas including geology, archaeology, biology, economics and political science. In this paper we investigate methods for classification of compositional data. Our approach centres on the idea of using the $α$-transformation to transform the data and then to classify the transformed data via regularised discriminant analysis and the k-nearest neighbours algorithm. Using the $α$-transformation generalises two rival approaches in compositional data analysis, one (when $α=1$) that treats the data as though they were Euclidean, ignoring the compositional constraint, and another (when $α=0$) that employs Aitchison's centred log-ratio transformation. A numerical study with several real datasets shows that whether using $α=1$ or $α=0$ gives better classification performance depends on the dataset, and moreover that using an intermediate value of $α$ can sometimes give better performance than using either 1 or 0.
△ Less
Submitted 17 June, 2015; v1 submitted 16 June, 2015;
originally announced June 2015.
-
A data-based power transformation for compositional data
Authors:
Michail T. Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
Compositional data analysis is carried out either by neglecting the compositional constraint and applying standard multivariate data analysis, or by transforming the data using the logs of the ratios of the components. In this work we examine a more general transformation which includes both approaches as special cases. It is a power transformation and involves a single parameter, α. The transform…
▽ More
Compositional data analysis is carried out either by neglecting the compositional constraint and applying standard multivariate data analysis, or by transforming the data using the logs of the ratios of the components. In this work we examine a more general transformation which includes both approaches as special cases. It is a power transformation and involves a single parameter, α. The transformation has two equivalent versions. The first is the stay-in-the-simplex version, which is the power transformation as defined by Aitchison in 1986. The second version, which is a linear transformation of the power transformation, is a Box-Cox type transformation. We discuss a parametric way of estimating the value of α, which is maximization of its profile likelihood (assuming multivariate normality of the transformed data) and the equivalence between the two versions is exhibited. Other ways include maximization of the correct classification probability in discriminant analysis and maximization of the pseudo R-squared (as defined by Aitchison in 1986) in linear regression. We examine the relationship between the α-transformation, the raw data approach and the isometric log-ratio transformation. Furthermore, we also define a suitable family of metrics corresponding to the family of α-transformation and consider the corresponding family of Frechet means.
△ Less
Submitted 16 June, 2011; v1 submitted 7 June, 2011;
originally announced June 2011.