-
A Monte Carlo comparison of categorical tests of independence
Authors:
Abdulaziz Alenazi
Abstract:
The $X^2$ and $G^2$ tests are the most frequently applied tests for testing the independence of two categorical variables. However, no one, to the best of our knowledge has compared them, extensively, and ultimately answer the question of which to use and when. Further, their applicability in cases with zero frequencies has been debated and (non parametric) permutation tests are suggested. In this…
▽ More
The $X^2$ and $G^2$ tests are the most frequently applied tests for testing the independence of two categorical variables. However, no one, to the best of our knowledge has compared them, extensively, and ultimately answer the question of which to use and when. Further, their applicability in cases with zero frequencies has been debated and (non parametric) permutation tests are suggested. In this work we perform extensive Monte Carlo simulation studies attempting to answer both aforementioned points. As expected, in large sample sized cases ($>1,000$) the $X^2$ and $G^2$ are indistinguishable. In the small sample sized cases ($\leq 1,000$) though, we provide strong evidence supporting the use of the $X^2$ test regardless of zero frequencies for the case of unconditional independence. Also, we suggest the use of the permutation based $G^2$ test for testing conditional independence, at the cost of being computationally more expensive. The $G^2$ test exhibited inferior performance and its use should be limited.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
Flexible non-parametric regression models for compositional data
Authors:
Michail Tsagris,
Abdulaziz Alenazi,
Connie Stewart
Abstract:
Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. When parametric assumptions do not hold or are difficult to verify, non-parametric regression models can provide a convenient alternative method for prediction. To this end, we consider an extension to the classical $k$--$NN$ regression, terme…
▽ More
Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. When parametric assumptions do not hold or are difficult to verify, non-parametric regression models can provide a convenient alternative method for prediction. To this end, we consider an extension to the classical $k$--$NN$ regression, termed $α$--$k$--$NN$ regression, that yields a highly flexible non-parametric regression model for compositional data through the use of the $α$-transformation. Unlike many of the recommended regression models for compositional data, zeros values (which commonly occur in practice) are not problematic and they can be incorporated into the proposed models without modification. Extensive simulation studies and real-life data analyses highlight the advantage of using these non-parametric regressions for complex relationships between the compositional response data and Euclidean predictor variables. Both suggest that $α$--$k$--$NN$ regression can lead to more accurate predictions compared to current regression models which assume a, sometimes restrictive, parametric relationship with the predictor variables. In addition, the $α$--$k$--$NN$ regression, in contrast to current regression techniques, enjoys a high computational efficiency rendering it highly attractive for use with large scale, massive, or big data.
△ Less
Submitted 6 September, 2023; v1 submitted 12 February, 2020;
originally announced February 2020.
-
Computationally efficient univariate filtering for massive data
Authors:
M. Tsagris,
A. Alenazi,
S. Fafalios
Abstract:
The vast availability of large scale, massive and big data has increased the computational cost of data analysis. One such case is the computational cost of the univariate filtering which typically involves fitting many univariate regression models and is essential for numerous variable selection algorithms to reduce the number of predictor variables. The paper manifests how to dramatically reduce…
▽ More
The vast availability of large scale, massive and big data has increased the computational cost of data analysis. One such case is the computational cost of the univariate filtering which typically involves fitting many univariate regression models and is essential for numerous variable selection algorithms to reduce the number of predictor variables. The paper manifests how to dramatically reduce that computational cost by employing the score test or the simple Pearson correlation (or the t-test for binary responses). Extensive Monte Carlo simulation studies will demonstrate their advantages and disadvantages compared to the likelihood ratio test and examples with real data will illustrate the performance of the score test and the log-likelihood ratio test under realistic scenarios. Depending on the regression model used, the score test is 30 - 60,000 times faster than the log-likelihood ratio test and produces nearly the same results. Hence this paper strongly recommends to substitute the log-likelihood ratio test with the score test when coping with large scale data, massive data, big data, or even with data whose sample size is in the order of a few tens of thousands or higher.
△ Less
Submitted 11 February, 2020;
originally announced February 2020.
-
Hypothesis testing for two population means: parametric or non-parametric test?
Authors:
Michail Tsagris,
Abdulaziz Alenazi,
Kleio-Maria Verrou,
Nikolaos Pandis
Abstract:
The parametric Welch $t$-test and the non-parametric Wilcoxon-Mann-Whitney test are the most commonly used two independent sample means tests. More recent testing approaches include the non-parametric, empirical likelihood and exponential empirical likelihood. However, the applicability of these non-parametric likelihood testing procedures is limited partially because of their tendency to inflate…
▽ More
The parametric Welch $t$-test and the non-parametric Wilcoxon-Mann-Whitney test are the most commonly used two independent sample means tests. More recent testing approaches include the non-parametric, empirical likelihood and exponential empirical likelihood. However, the applicability of these non-parametric likelihood testing procedures is limited partially because of their tendency to inflate the type I error in small sized samples. In order to circumvent the type I error problem, we propose simple calibrations using the $t$ distribution and bootstrapping. The two non-parametric likelihood testing procedures, with and without those calibrations, are then compared against the Wilcoxon-Mann-Whitney test and the Welch $t$-test. The comparisons are implemented via extensive Monte Carlo simulations on the grounds of type I error and power in small/medium sized samples generated from various non-normal populations. The simulation studies clearly demonstrate that a) the $t$ calibration improves the type I error of the empirical likelihood, b) bootstrap calibration improves the type I error of both non-parametric likelihoods, c) the Welch $t$-test with or without bootstrap calibration attains the type I error and produces similar levels of power with the former testing procedures, and d) the Wilcoxon-Mann-Whitney test produces inflated type I error while the computation of an exact p-value is not feasible in the presence of ties with discrete data. Further, an application to real gene expression data illustrates the computational high cost and thus the impracticality of the non parametric likelihoods. Overall, the Welch t-test, which is highly computationally efficient and readily interpretable, is shown to be the best method when testing equality of two population means.
△ Less
Submitted 4 October, 2019; v1 submitted 29 December, 2018;
originally announced December 2018.