-
Fast and light-weight energy statistics using the \textit{R} package \textsf{Rfast}
Authors:
Michail Tsagris,
Manos Papadakis
Abstract:
Energy statistics, also known as $\mathcal{\varepsilon}$-statistics, are functions of distances between statistical observations. This class of functions has enabled the development of non-linear statistical concepts, such as distance variance, distance covariance, and distance correlation. However, the computational burden associated with $\mathcal{\varepsilon}$-statistics is substantial, particu…
▽ More
Energy statistics, also known as $\mathcal{\varepsilon}$-statistics, are functions of distances between statistical observations. This class of functions has enabled the development of non-linear statistical concepts, such as distance variance, distance covariance, and distance correlation. However, the computational burden associated with $\mathcal{\varepsilon}$-statistics is substantial, particularly when the data reside in multivariate space. To address this challenge, we have developed a method to significantly reduce memory requirements and accelerate computations, thereby facilitating the analysis of large data sets. The following cases are demonstrated: univariate and multivariate distance variance, distance covariance, partial distance correlation, energy distance, and hypothesis testing for the equality of univariate distributions.
△ Less
Submitted 21 February, 2025; v1 submitted 6 January, 2025;
originally announced January 2025.
-
Energy Based Equality of Distributions Testing for Compositional Data
Authors:
Volkan Sevinc,
Michail Tsagris
Abstract:
Not many tests exist for testing the equality for two or more multivariate distributions with compositional data, perhaps due to their constrained sample space. At the moment, there is only one test suggested that relies upon random projections. We propose a novel test termed α-Energy Based Test (α-EBT) to compare the multivariate distributions of two (or more) compositional data sets. Similar to…
▽ More
Not many tests exist for testing the equality for two or more multivariate distributions with compositional data, perhaps due to their constrained sample space. At the moment, there is only one test suggested that relies upon random projections. We propose a novel test termed α-Energy Based Test (α-EBT) to compare the multivariate distributions of two (or more) compositional data sets. Similar to the aforementioned test, the new test makes no parametric assumptions about the data and, based on simulation studies it exhibits higher power levels.
△ Less
Submitted 11 December, 2024; v1 submitted 6 December, 2024;
originally announced December 2024.
-
Directional data analysis using the spherical Cauchy and the Poisson kernel-based distribution
Authors:
Michail Tsagris,
Panagiotis Papastamoulis,
Shogo Kato
Abstract:
In 2020, two novel distributions for the analysis of directional data were introduced: the spherical Cauchy distribution and the Poisson kernel-based distribution. This paper provides a detailed exploration of both distributions within various analytical frameworks. To enhance the practical utility of these distributions, alternative parametrizations that offer advantages in numerical stability an…
▽ More
In 2020, two novel distributions for the analysis of directional data were introduced: the spherical Cauchy distribution and the Poisson kernel-based distribution. This paper provides a detailed exploration of both distributions within various analytical frameworks. To enhance the practical utility of these distributions, alternative parametrizations that offer advantages in numerical stability and parameter estimation are presented, such as implementation of the Newton-Raphson algorithm for parameter estimation, while facilitating a more efficient and simplified approach in the regression framework. Additionally, a two-sample location test based on the log-likelihood ratio test is introduced. This test is designed to assess whether the location parameters of two populations can be assumed equal. The maximum likelihood discriminant analysis framework is developed for classification purposes, and finally, the problem of clustering directional data is addressed, by fitting finite mixtures of Spherical Cauchy or Poisson kernel-based distributions. Empirical validation is conducted through comprehensive simulation studies and real data applications, wherein the performance of the spherical Cauchy and Poisson kernel-based distributions is systematically compared.
△ Less
Submitted 10 November, 2024; v1 submitted 5 September, 2024;
originally announced September 2024.
-
Constrained least squares simplicial-simplicial regression
Authors:
Michail Tsagris
Abstract:
Simplicial-simplicial regression refers to the regression setting where both the responses and predictor variables lie within the simplex space, i.e. they are compositional. For this setting, constrained least squares, where the regression coefficients themselves lie within the simplex, is proposed. The model is transformation-free but the adoption of a power transformation is straightforward, it…
▽ More
Simplicial-simplicial regression refers to the regression setting where both the responses and predictor variables lie within the simplex space, i.e. they are compositional. For this setting, constrained least squares, where the regression coefficients themselves lie within the simplex, is proposed. The model is transformation-free but the adoption of a power transformation is straightforward, it can treat more than one compositional datasets as predictors and offers the possibility of weights among the simplicial predictors. Among the model's advantages are its ability to treat zeros in a natural way and a highly computationally efficient algorithm to estimate its coefficients. Resampling based hypothesis testing procedures are employed regarding inference, such as linear independence, and equality of the regression coefficients to some pre-specified values. The strategy behind the formulation of the new model is implemented is related to an existing methodology, that is of the same spirit, showcasing how other similar models can be employed as well. Finally, the performance of the proposed technique and its comparison to the existing methodology takes place using simulation studies and real data examples.
△ Less
Submitted 23 December, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
-
Circular and Spherical Projected Cauchy Distributions: A Novel Framework for Circular and Directional Data Modeling
Authors:
Michail Tsagris,
Omar Alzeley
Abstract:
We introduce a novel family of projected distributions on the circle and the sphere, namely the circular and spherical projected Cauchy distributions, as promising alternatives for modelling circular and spherical data. The circular distribution encompasses the wrapped Cauchy distribution as a special case, while featuring a more convenient parameterisation. We also propose a generalised wrapped C…
▽ More
We introduce a novel family of projected distributions on the circle and the sphere, namely the circular and spherical projected Cauchy distributions, as promising alternatives for modelling circular and spherical data. The circular distribution encompasses the wrapped Cauchy distribution as a special case, while featuring a more convenient parameterisation. We also propose a generalised wrapped Cauchy distribution that includes an extra parameter, enhancing the fit of the distribution. In the spherical context, we impose two conditions on the scatter matrix of the Cauchy distribution, resulting in an elliptically symmetric distribution. Our projected distributions exhibit attractive properties, such as a closed-form normalising constant and straightforward random value generation. The distribution parameters can be estimated using maximum likelihood, and we assess their bias through numerical studies. Further, we compare our proposed distributions to existing models with real datasets, demonstrating equal or superior fitting both with and without covariates.
△ Less
Submitted 11 September, 2024; v1 submitted 5 February, 2023;
originally announced February 2023.
-
Cauchy robust principal component analysis with applications to high-deimensional data sets
Authors:
Ayisha Fayomi,
Yannis Pantazis,
Michail Tsagris,
Andrew T. A. Wood
Abstract:
Principal component analysis (PCA) is a standard dimensionality reduction technique used in various research and applied fields. From an algorithmic point of view, classical PCA can be formulated in terms of operations on a multivariate Gaussian likelihood. As a consequence of the implied Gaussian formulation, the principal components are not robust to outliers. In this paper, we propose a modifie…
▽ More
Principal component analysis (PCA) is a standard dimensionality reduction technique used in various research and applied fields. From an algorithmic point of view, classical PCA can be formulated in terms of operations on a multivariate Gaussian likelihood. As a consequence of the implied Gaussian formulation, the principal components are not robust to outliers. In this paper, we propose a modified formulation, based on the use of a multivariate Cauchy likelihood instead of the Gaussian likelihood, which has the effect of robustifying the principal components. We present an algorithm to compute these robustified principal components. We additionally derive the relevant influence function of the first component and examine its theoretical properties. Simulation experiments on high-dimensional datasets demonstrate that the estimated principal components based on the Cauchy likelihood outperform or are on par with existing robust PCA techniques.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
Inference for Network Count Time Series with the R Package PNAR
Authors:
Mirko Armillotta,
Michail Tsagris,
Konstantinos Fokianos
Abstract:
We introduce a new R package useful for inference about network count time series. Such data are frequently encountered in statistics and they are usually treated as multivariate time series. Their statistical analysis is based on linear or log linear models. Nonlinear models, which have been applied successfully in several research areas, have been neglected from such applications mainly because…
▽ More
We introduce a new R package useful for inference about network count time series. Such data are frequently encountered in statistics and they are usually treated as multivariate time series. Their statistical analysis is based on linear or log linear models. Nonlinear models, which have been applied successfully in several research areas, have been neglected from such applications mainly because of their computational complexity. We provide R users the flexibility to fit and study nonlinear network count time series models which include either a drift in the intercept or a regime switching mechanism. We develop several computational tools including estimation of various count Network Autoregressive models and fast computational algorithms for testing linearity in standard cases and when non-identifiable parameters hamper the analysis. Finally, we introduce a copula Poisson algorithm for simulating multivariate network count time series. We illustrate the methodology by modeling weekly number of influenza cases in Germany.
△ Less
Submitted 25 October, 2023; v1 submitted 4 November, 2022;
originally announced November 2022.
-
Modelling structural zeros in compositional data via a zero-censored multivariate normal model
Authors:
Michail Tsagris
Abstract:
We present a new model for analyzing compositional data with structural zeros. Inspired by \cite{butler2008} who suggested a model in the presence of zero values in the data we propose a model that treats the zero values in a different manner. Instead of projecting every zero value towards a vertex, we project them onto their corresponding edge and fit a zero-censored multivariate model.
We present a new model for analyzing compositional data with structural zeros. Inspired by \cite{butler2008} who suggested a model in the presence of zero values in the data we propose a model that treats the zero values in a different manner. Instead of projecting every zero value towards a vertex, we project them onto their corresponding edge and fit a zero-censored multivariate model.
△ Less
Submitted 27 August, 2022;
originally announced August 2022.
-
The FEDHC Bayesian network learning algorithm
Authors:
Michail Tsagris
Abstract:
The paper proposes a new hybrid Bayesian network learning algorithm, termed Forward Early Dropping Hill Climbing (FEDHC), devised to work with either continuous or categorical variables. Further, the paper manifests that the only implementation of MMHC in the statistical software \textit{R}, is prohibitively expensive and a new implementation is offered. Further, specifically for the case of conti…
▽ More
The paper proposes a new hybrid Bayesian network learning algorithm, termed Forward Early Dropping Hill Climbing (FEDHC), devised to work with either continuous or categorical variables. Further, the paper manifests that the only implementation of MMHC in the statistical software \textit{R}, is prohibitively expensive and a new implementation is offered. Further, specifically for the case of continuous data, a robust to outliers version of FEDHC, that can be adopted by other BN learning algorithms, is proposed. The FEDHC is tested via Monte Carlo simulations that distinctly show it is computationally efficient, and produces Bayesian networks of similar to, or of higher accuracy than MMHC and PCHC. Finally, an application of FEDHC, PCHC and MMHC algorithms to real data, from the field of economics, is demonstrated using the statistical software \textit{R}.
△ Less
Submitted 12 August, 2022; v1 submitted 30 November, 2020;
originally announced December 2020.
-
Estimating NBA players salary share according to their performance on court: A machine learning approach
Authors:
Ioanna Papadaki,
Michail Tsagris
Abstract:
It is customary for researchers and practitioners to fit linear models in order to predict NBA player's salary based on the players' performance on court. On the contrary, we focus on the players salary share (with regards to the team payroll) by first selecting the most important determinants or statistics (years of experience in the league, games played, etc.) and then utilise them to predict th…
▽ More
It is customary for researchers and practitioners to fit linear models in order to predict NBA player's salary based on the players' performance on court. On the contrary, we focus on the players salary share (with regards to the team payroll) by first selecting the most important determinants or statistics (years of experience in the league, games played, etc.) and then utilise them to predict the player salaries by employing a non linear Random Forest machine learning algorithm. We externally evaluate our salary predictions, thus we avoid the phenomenon of over-fitting observed in most papers. Overall, using data from three distinct periods, 2017-2019 we identify the important factors that achieve very satisfactory salary predictions and we draw useful conclusions.
△ Less
Submitted 31 October, 2020; v1 submitted 29 July, 2020;
originally announced July 2020.
-
A generalised OMP algorithm for feature selection with application to gene expression data
Authors:
Michail Tsagris,
Zacharias Papadovasilakis,
Kleanthi Lakiotaki,
Ioannis Tsamardinos
Abstract:
Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of available features. In this paper, we propose gOMP, a highly-scalable generalisation of the Orthogonal Matching Pursuit feature selectio…
▽ More
Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of available features. In this paper, we propose gOMP, a highly-scalable generalisation of the Orthogonal Matching Pursuit feature selection algorithm to several directions: (a) different types of outcomes, such as continuous, binary, nominal, and time-to-event, (b) different types of predictive models (e.g., linear least squares, logistic regression), (c) different types of predictive features (continuous, categorical), and (d) different, statistical-based stopping criteria. We compare the proposed algorithm against LASSO, a prototypical, widely used algorithm for high-dimensional data. On dozens of simulated datasets, as well as, real gene expression datasets, gOMP is on par, or outperforms LASSO for case-control binary classification, quantified outcomes (regression), and (censored) survival times (time-to-event) analysis. gOMP has also several theoretical advantages that are discussed. While gOMP is based on quite simple and basic statistical ideas, easy to implement and to generalize, we also show in an extensive evaluation that it is also quite effective in bioinformatics analysis settings.
△ Less
Submitted 1 April, 2020;
originally announced April 2020.
-
Flexible non-parametric regression models for compositional data
Authors:
Michail Tsagris,
Abdulaziz Alenazi,
Connie Stewart
Abstract:
Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. When parametric assumptions do not hold or are difficult to verify, non-parametric regression models can provide a convenient alternative method for prediction. To this end, we consider an extension to the classical $k$--$NN$ regression, terme…
▽ More
Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. When parametric assumptions do not hold or are difficult to verify, non-parametric regression models can provide a convenient alternative method for prediction. To this end, we consider an extension to the classical $k$--$NN$ regression, termed $α$--$k$--$NN$ regression, that yields a highly flexible non-parametric regression model for compositional data through the use of the $α$-transformation. Unlike many of the recommended regression models for compositional data, zeros values (which commonly occur in practice) are not problematic and they can be incorporated into the proposed models without modification. Extensive simulation studies and real-life data analyses highlight the advantage of using these non-parametric regressions for complex relationships between the compositional response data and Euclidean predictor variables. Both suggest that $α$--$k$--$NN$ regression can lead to more accurate predictions compared to current regression models which assume a, sometimes restrictive, parametric relationship with the predictor variables. In addition, the $α$--$k$--$NN$ regression, in contrast to current regression techniques, enjoys a high computational efficiency rendering it highly attractive for use with large scale, massive, or big data.
△ Less
Submitted 6 September, 2023; v1 submitted 12 February, 2020;
originally announced February 2020.
-
Computationally efficient univariate filtering for massive data
Authors:
M. Tsagris,
A. Alenazi,
S. Fafalios
Abstract:
The vast availability of large scale, massive and big data has increased the computational cost of data analysis. One such case is the computational cost of the univariate filtering which typically involves fitting many univariate regression models and is essential for numerous variable selection algorithms to reduce the number of predictor variables. The paper manifests how to dramatically reduce…
▽ More
The vast availability of large scale, massive and big data has increased the computational cost of data analysis. One such case is the computational cost of the univariate filtering which typically involves fitting many univariate regression models and is essential for numerous variable selection algorithms to reduce the number of predictor variables. The paper manifests how to dramatically reduce that computational cost by employing the score test or the simple Pearson correlation (or the t-test for binary responses). Extensive Monte Carlo simulation studies will demonstrate their advantages and disadvantages compared to the likelihood ratio test and examples with real data will illustrate the performance of the score test and the log-likelihood ratio test under realistic scenarios. Depending on the regression model used, the score test is 30 - 60,000 times faster than the log-likelihood ratio test and produces nearly the same results. Hence this paper strongly recommends to substitute the log-likelihood ratio test with the score test when coping with large scale data, massive data, big data, or even with data whose sample size is in the order of a few tens of thousands or higher.
△ Less
Submitted 11 February, 2020;
originally announced February 2020.
-
Hypothesis testing for two population means: parametric or non-parametric test?
Authors:
Michail Tsagris,
Abdulaziz Alenazi,
Kleio-Maria Verrou,
Nikolaos Pandis
Abstract:
The parametric Welch $t$-test and the non-parametric Wilcoxon-Mann-Whitney test are the most commonly used two independent sample means tests. More recent testing approaches include the non-parametric, empirical likelihood and exponential empirical likelihood. However, the applicability of these non-parametric likelihood testing procedures is limited partially because of their tendency to inflate…
▽ More
The parametric Welch $t$-test and the non-parametric Wilcoxon-Mann-Whitney test are the most commonly used two independent sample means tests. More recent testing approaches include the non-parametric, empirical likelihood and exponential empirical likelihood. However, the applicability of these non-parametric likelihood testing procedures is limited partially because of their tendency to inflate the type I error in small sized samples. In order to circumvent the type I error problem, we propose simple calibrations using the $t$ distribution and bootstrapping. The two non-parametric likelihood testing procedures, with and without those calibrations, are then compared against the Wilcoxon-Mann-Whitney test and the Welch $t$-test. The comparisons are implemented via extensive Monte Carlo simulations on the grounds of type I error and power in small/medium sized samples generated from various non-normal populations. The simulation studies clearly demonstrate that a) the $t$ calibration improves the type I error of the empirical likelihood, b) bootstrap calibration improves the type I error of both non-parametric likelihoods, c) the Welch $t$-test with or without bootstrap calibration attains the type I error and produces similar levels of power with the former testing procedures, and d) the Wilcoxon-Mann-Whitney test produces inflated type I error while the computation of an exact p-value is not feasible in the presence of ties with discrete data. Further, an application to real gene expression data illustrates the computational high cost and thus the impracticality of the non parametric likelihoods. Overall, the Welch t-test, which is highly computationally efficient and readily interpretable, is shown to be the best method when testing equality of two population means.
△ Less
Submitted 4 October, 2019; v1 submitted 29 December, 2018;
originally announced December 2018.
-
Gaussian asymptotic limits for the $α$-transformation in the analysis of compositional data
Authors:
Yannis Pantazis,
Michail Tsagris,
Andrew T. A. Wood
Abstract:
Compositional data consists of vectors of proportions whose components sum to 1. Such vectors lie in the standard simplex, which is a manifold with boundary. One issue that has been rather controversial within the field of compositional data analysis is the choice of metric on the simplex. One popular possibility has been to use the metric implied by logtransforming the data, as proposed by Aitchi…
▽ More
Compositional data consists of vectors of proportions whose components sum to 1. Such vectors lie in the standard simplex, which is a manifold with boundary. One issue that has been rather controversial within the field of compositional data analysis is the choice of metric on the simplex. One popular possibility has been to use the metric implied by logtransforming the data, as proposed by Aitchison [1, 2]; and another popular approach has been to use the standard Euclidean metric inherited from the ambient space. Tsagris et al. [21] proposed a one-parameter family of power transformations, the $α$-transformations, which include both the metric implied by Aitchison's transformation and the Euclidean metric as particular cases. Our underlying philosophy is that, with many datasets, it may make sense to use the data to help us determine a suitable metric. A related possibility is to apply the $α$-transformations to a parametric family of distributions, and then estimate a along with the other parameters. However, as we shall see, when one follows this last approach with the Dirichlet family, some care is needed in a certain limiting case which arises $(α\neq 0)$, as we found out when fitting this model to real and simulated data. Specifically, when the maximum likelihood estimator of a is close to 0, the other parameters tend to be large. The main purpose of the paper is to study this limiting case both theoretically and numerically and to provide insight into these numerical findings.
△ Less
Submitted 21 February, 2019; v1 submitted 29 November, 2018;
originally announced December 2018.
-
Extremely efficient permutation and bootstrap hypothesis tests using R
Authors:
Christina Chatzipantsiou,
Marios Dimitriadis,
Manos Papadakis,
Michail Tsagris
Abstract:
Re-sampling based statistical tests are known to be computationally heavy, but reliable when small sample sizes are available. Despite their nice theoretical properties not much effort has been put to make them efficient. In this paper we treat the case of Pearson correlation coefficient and two independent samples t-test. We propose a highly computationally efficient method for calculating permut…
▽ More
Re-sampling based statistical tests are known to be computationally heavy, but reliable when small sample sizes are available. Despite their nice theoretical properties not much effort has been put to make them efficient. In this paper we treat the case of Pearson correlation coefficient and two independent samples t-test. We propose a highly computationally efficient method for calculating permutation based p-values in these two cases. The method is general and can be applied or be adopted to other similar two sample mean or two mean vectors cases.
△ Less
Submitted 28 June, 2018;
originally announced June 2018.
-
A folded model for compositional data analysis
Authors:
Michail Tsagris,
Connie Stewart
Abstract:
A folded type model is developed for analyzing compositional data. The proposed model involves an extension of the $α$-transformation for compositional data and provides a new and flexible class of distributions for modeling data defined on the simplex sample space. Despite its rather seemingly complex structure, employment of the EM algorithm guarantees efficient parameter estimation. The model i…
▽ More
A folded type model is developed for analyzing compositional data. The proposed model involves an extension of the $α$-transformation for compositional data and provides a new and flexible class of distributions for modeling data defined on the simplex sample space. Despite its rather seemingly complex structure, employment of the EM algorithm guarantees efficient parameter estimation. The model is validated through simulation studies and examples which illustrate that the proposed model performs better in terms of capturing the data structure, when compared to the popular logistic normal distribution, and can be advantageous over a similar model without folding.
△ Less
Submitted 26 February, 2019; v1 submitted 20 February, 2018;
originally announced February 2018.
-
Conditional independence test for categorical data using Poisson log-linear model
Authors:
Michail Tsagris
Abstract:
We demonstrate how to test for conditional independence of two variables with categorical data using Poisson log-linear models. The size of the conditioning set of variables can vary from 0 (simple independence) up to many variables. We also provide a function in R for performing the test. Instead of calculating all possible tables with for loop we perform the test using the log-linear models and…
▽ More
We demonstrate how to test for conditional independence of two variables with categorical data using Poisson log-linear models. The size of the conditioning set of variables can vary from 0 (simple independence) up to many variables. We also provide a function in R for performing the test. Instead of calculating all possible tables with for loop we perform the test using the log-linear models and thus speeding up the process. Time comparison simulation studies are presented.
△ Less
Submitted 7 June, 2017;
originally announced June 2017.
-
Feature Selection with the R Package MXM: Discovering Statistically-Equivalent Feature Subsets
Authors:
Vincenzo Lagani,
Giorgos Athineou,
Alessio Farcomeni,
Michail Tsagris,
Ioannis Tsamardinos
Abstract:
The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constrained-based learning of Bayesian Networks. Most of the currently available feature-selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal…
▽ More
The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constrained-based learning of Bayesian Networks. Most of the currently available feature-selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. Under that respect SES subsumes and extends previous feature selection algorithms, like the max-min parent children algorithm. SES is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data-analysis tasks, namely classification, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data.
△ Less
Submitted 10 November, 2016;
originally announced November 2016.
-
Nonparametric hypothesis testing for equality of means on the simplex
Authors:
Michail Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
In the context of data that lie on the simplex, we investigate use of empirical and exponential empirical likelihood, and Hotelling and James statistics, to test the null hypothesis of equal population means based on two independent samples. We perform an extensive numerical study using data simulated from various distributions on the simplex. The results, taken together with practical considerati…
▽ More
In the context of data that lie on the simplex, we investigate use of empirical and exponential empirical likelihood, and Hotelling and James statistics, to test the null hypothesis of equal population means based on two independent samples. We perform an extensive numerical study using data simulated from various distributions on the simplex. The results, taken together with practical considerations regarding implementation, support the use of bootstrap-calibrated James statistic.
△ Less
Submitted 4 August, 2016; v1 submitted 27 July, 2016;
originally announced July 2016.
-
Exploring the Distribution for the Estimator of Rosenthal's 'Fail-Safe' Number of Unpublished Studies in Meta-analysis
Authors:
Konstantinos C. Fragkos,
Michail Tsagris,
Christos C. Frangos
Abstract:
The present paper discusses the statistical distribution for the estimator of Rosenthal's 'Fail-Safe' number NR, which is an estimator of unpublished studies in meta-analysis. We calculate the probability distribution function of NR. This is achieved based on the Central Limit Theorem and the proposition that certain components of the estimator NR follow a half normal distribution, derived from th…
▽ More
The present paper discusses the statistical distribution for the estimator of Rosenthal's 'Fail-Safe' number NR, which is an estimator of unpublished studies in meta-analysis. We calculate the probability distribution function of NR. This is achieved based on the Central Limit Theorem and the proposition that certain components of the estimator NR follow a half normal distribution, derived from the standard normal distribution. Our proposed distributions are supported by simulations and investigation of convergence.
△ Less
Submitted 24 November, 2015;
originally announced November 2015.
-
A novel, divergence based, regression for compositional data
Authors:
Michail Tsagris
Abstract:
In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science amongst others. The goal of this paper is to propose a new, divergence based, regression modelling technique for compositional data. To do so, a recently proved metric which…
▽ More
In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science amongst others. The goal of this paper is to propose a new, divergence based, regression modelling technique for compositional data. To do so, a recently proved metric which is a special case of the Jensen-Shannon divergence is employed. A strong advantage of this new regression technique is that zeros are naturally handled. An example with real data and simulation studies are presented and are both compared with the log-ratio based regression suggested by Aitchison in 1986.
△ Less
Submitted 24 November, 2015;
originally announced November 2015.
-
The Assessment of Performance of Correlation Estimates in Discrete Bivariate Distributions Using Bootstrap Methodology
Authors:
Michael Tsagris,
Ioannis Elmatzoglou,
Christos C. Frangos
Abstract:
Little attention has been given to the correlation coefficient when data come from discrete or continuous non-normal populations. In this article, we consider the efficiency of two correlation coefficients which are from the same family, Pearson's and Spearman's estimators. Two discrete bivariate distributions were examined: the Poisson and the Negative Binomial. The comparison between these two e…
▽ More
Little attention has been given to the correlation coefficient when data come from discrete or continuous non-normal populations. In this article, we consider the efficiency of two correlation coefficients which are from the same family, Pearson's and Spearman's estimators. Two discrete bivariate distributions were examined: the Poisson and the Negative Binomial. The comparison between these two estimators took place using classical and bootstrap techniques for the construction of confidence intervals. Thus, these techniques are also subject to comparison. Simulation studies were also used for the relative efficiency and bias of the two estimators. Pearson's estimator performed slightly better than Spearman's.
△ Less
Submitted 5 November, 2015;
originally announced November 2015.
-
Publication Bias in Meta-Analysis: Confidence Intervals for Rosenthal's Fail-Safe Number
Authors:
Konstantinos C. Fragkos,
Michail Tsagris,
Christos C. Frangos
Abstract:
The purpose of the present paper is to assess the efficacy of confidence intervals for Rosenthal's fail-safe number. Although Rosenthal's estimator is highly used by researchers, its statistical properties are largely unexplored. First of all, we developed statistical theory which allowed us to produce confidence intervals for Rosenthal's fail-safe number.This was produced by discerning whether th…
▽ More
The purpose of the present paper is to assess the efficacy of confidence intervals for Rosenthal's fail-safe number. Although Rosenthal's estimator is highly used by researchers, its statistical properties are largely unexplored. First of all, we developed statistical theory which allowed us to produce confidence intervals for Rosenthal's fail-safe number.This was produced by discerning whether the number of studies analysed in a meta-analysis is fixed or random. Each case produces different variance estimators. For a given number of studies and a given distribution, we provided five variance estimators. Confidence intervals are examined with a normal approximation and a nonparametric bootstrap. The accuracy of the different confidence interval estimates was then tested by methods of simulation under different distributional assumptions. The half normal distribution variance estimator has the best probability coverage. Finally, we provide a table of lower confidence intervals for Rosenthal's estimator.
△ Less
Submitted 4 September, 2015;
originally announced September 2015.
-
Regression analysis with compositional data containing zero values
Authors:
Michail Tsagris
Abstract:
Regression analysis with compositional data containing zero values
Regression analysis with compositional data containing zero values
△ Less
Submitted 8 August, 2015;
originally announced August 2015.
-
The k-NN algorithm for compositional data: a revised approach with and without zero values present
Authors:
Michail Tsagris
Abstract:
In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science among others. The goal of this paper is to extend the taxicab metric and a newly suggested metric for compositional data by employing a power transformation. Both metrics ar…
▽ More
In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science among others. The goal of this paper is to extend the taxicab metric and a newly suggested metric for compositional data by employing a power transformation. Both metrics are to be used in the k-nearest neighbours algorithm regardless of the presence of zeros. Examples with real data are exhibited.
△ Less
Submitted 17 June, 2015;
originally announced June 2015.
-
Improved classification for compositional data using the $α$-transformation
Authors:
Michail Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
In compositional data analysis an observation is a vector containing non-negative values, only the relative sizes of which are considered to be of interest. Without loss of generality, a compositional vector can be taken to be a vector of proportions that sum to one. Data of this type arise in many areas including geology, archaeology, biology, economics and political science. In this paper we inv…
▽ More
In compositional data analysis an observation is a vector containing non-negative values, only the relative sizes of which are considered to be of interest. Without loss of generality, a compositional vector can be taken to be a vector of proportions that sum to one. Data of this type arise in many areas including geology, archaeology, biology, economics and political science. In this paper we investigate methods for classification of compositional data. Our approach centres on the idea of using the $α$-transformation to transform the data and then to classify the transformed data via regularised discriminant analysis and the k-nearest neighbours algorithm. Using the $α$-transformation generalises two rival approaches in compositional data analysis, one (when $α=1$) that treats the data as though they were Euclidean, ignoring the compositional constraint, and another (when $α=0$) that employs Aitchison's centred log-ratio transformation. A numerical study with several real datasets shows that whether using $α=1$ or $α=0$ gives better classification performance depends on the dataset, and moreover that using an intermediate value of $α$ can sometimes give better performance than using either 1 or 0.
△ Less
Submitted 17 June, 2015; v1 submitted 16 June, 2015;
originally announced June 2015.
-
A Dirichlet Regression Model for Compositional Data with Zeros
Authors:
Michail Tsagris,
Connie Stewart
Abstract:
Compositional data are met in many different fields, such as economics, archaeometry, ecology, geology and political sciences. Regression where the dependent variable is a composition is usually carried out via a log-ratio transformation of the composition or via the Dirichlet distribution. However, when there are zero values in the data these two ways are not readily applicable. Suggestions for t…
▽ More
Compositional data are met in many different fields, such as economics, archaeometry, ecology, geology and political sciences. Regression where the dependent variable is a composition is usually carried out via a log-ratio transformation of the composition or via the Dirichlet distribution. However, when there are zero values in the data these two ways are not readily applicable. Suggestions for this problem exist, but most of them rely on substituting the zero values. In this paper we adjust the Dirichlet distribution when covariates are present, in order to allow for zero values to be present in the data, without modifying any values. To do so, we modify the log-likelihood of the Dirichlet distribution to account for zero values. Examples and simulation studies exhibit the performance of the zero adjusted Dirichlet regression.
△ Less
Submitted 7 June, 2017; v1 submitted 18 October, 2014;
originally announced October 2014.
-
On the folded normal distribution
Authors:
Michail Tsagris,
Christina Beneki,
Hossein Hassani
Abstract:
The characteristic function of the folded normal distribution and its moment function are derived. The entropy of the folded normal distribution and the Kullback--Leibler from the normal and half normal distributions are approximated using Taylor series. The accuracy of the results are also assessed using different criteria. The maximum likelihood estimates and confidence intervals for the paramet…
▽ More
The characteristic function of the folded normal distribution and its moment function are derived. The entropy of the folded normal distribution and the Kullback--Leibler from the normal and half normal distributions are approximated using Taylor series. The accuracy of the results are also assessed using different criteria. The maximum likelihood estimates and confidence intervals for the parameters are obtained using the asymptotic theory and bootstrap method. The coverage of the confidence intervals is also examined.
△ Less
Submitted 14 February, 2014;
originally announced February 2014.
-
A data-based power transformation for compositional data
Authors:
Michail T. Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
Compositional data analysis is carried out either by neglecting the compositional constraint and applying standard multivariate data analysis, or by transforming the data using the logs of the ratios of the components. In this work we examine a more general transformation which includes both approaches as special cases. It is a power transformation and involves a single parameter, α. The transform…
▽ More
Compositional data analysis is carried out either by neglecting the compositional constraint and applying standard multivariate data analysis, or by transforming the data using the logs of the ratios of the components. In this work we examine a more general transformation which includes both approaches as special cases. It is a power transformation and involves a single parameter, α. The transformation has two equivalent versions. The first is the stay-in-the-simplex version, which is the power transformation as defined by Aitchison in 1986. The second version, which is a linear transformation of the power transformation, is a Box-Cox type transformation. We discuss a parametric way of estimating the value of α, which is maximization of its profile likelihood (assuming multivariate normality of the transformed data) and the equivalence between the two versions is exhibited. Other ways include maximization of the correct classification probability in discriminant analysis and maximization of the pseudo R-squared (as defined by Aitchison in 1986) in linear regression. We examine the relationship between the α-transformation, the raw data approach and the isometric log-ratio transformation. Furthermore, we also define a suitable family of metrics corresponding to the family of α-transformation and consider the corresponding family of Frechet means.
△ Less
Submitted 16 June, 2011; v1 submitted 7 June, 2011;
originally announced June 2011.