-
Efficient subsampling for high-dimensional data
Authors:
Vasilis Chasiotis,
Lin Wang,
Dimitris Karlis
Abstract:
In the field of big data analytics, the search for efficient subdata selection methods that enable robust statistical inferences with minimal computational resources is of high importance. A procedure prior to subdata selection could perform variable selection, as only a subset of a large number of variables is active. We propose an approach when both the size of the full dataset and the number of…
▽ More
In the field of big data analytics, the search for efficient subdata selection methods that enable robust statistical inferences with minimal computational resources is of high importance. A procedure prior to subdata selection could perform variable selection, as only a subset of a large number of variables is active. We propose an approach when both the size of the full dataset and the number of variables are large. This approach firstly identifies the active variables by applying a procedure inspired by random LASSO (Least Absolute Shrinkage and Selection Operator) and then selects subdata based on leverage scores to build a predictive model. Our proposed approach outperforms approaches that already exists in the current literature, including the usage of the full dataset, in both variable selection and prediction, while also exhibiting significant improvements in computing time. Simulation experiments as well as a real data application are provided.
△ Less
Submitted 9 November, 2024;
originally announced November 2024.
-
A Model-Based Approach to Shot Charts Estimation in Basketball
Authors:
Luca Scrucca,
Dimitris Karlis
Abstract:
Shot charts in basketball analytics provide an indispensable tool for evaluating players' shooting performance by visually representing the distribution of field goal attempts across different court locations. However, conventional methods often overlook the bounded nature of the basketball court, leading to inaccurate representations, particularly along the boundaries and corners. In this paper,…
▽ More
Shot charts in basketball analytics provide an indispensable tool for evaluating players' shooting performance by visually representing the distribution of field goal attempts across different court locations. However, conventional methods often overlook the bounded nature of the basketball court, leading to inaccurate representations, particularly along the boundaries and corners. In this paper, we propose a novel model-based approach to shot chart estimation and visualization that explicitly considers the physical boundaries of the basketball court. By employing Gaussian mixtures for bounded data, our methodology allows to obtain more accurate estimation of shot density distributions for both made and missed shots. Bayes' rule is then applied to derive estimates for the probability of successful shooting from any given locations, and to identify the regions with the highest expected scores. To illustrate the efficacy of our proposal, we apply it to data from the 2022-23 NBA regular season, showing its usefulness through detailed analyses of shot patterns for two prominent players.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
On the estimation of complex statistics combining different surveys
Authors:
Vasilis Chasiotis,
Dimitris Karlis
Abstract:
The importance of exploring a potential integration among surveys has been acknowledged in order to enhance effectiveness and minimize expenses. In this work, we employ the alignment method to combine information from two different surveys for the estimation of complex statistics. The derivation of the alignment weights poses challenges in case of complex statistics due to their non-linear form. T…
▽ More
The importance of exploring a potential integration among surveys has been acknowledged in order to enhance effectiveness and minimize expenses. In this work, we employ the alignment method to combine information from two different surveys for the estimation of complex statistics. The derivation of the alignment weights poses challenges in case of complex statistics due to their non-linear form. To overcome this, we propose to use a linearized variable associated with the complex statistic under consideration. Linearized variables have been widely used to derive variance estimates, thus allowing for the estimation of the variance of the combined complex statistics estimates. Simulations conducted show the effectiveness of the proposed approach, resulting to the reduction of the variance of the combined complex statistics estimates. Also, in some cases, the usage of the alignment weights derived using the linearized variable associated with a complex statistic, could result in a further reduction of the variance of the combined estimates.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Modelling handball outcomes using univariate and bivariate approaches
Authors:
Dimitris Karlis,
Rouven Michels,
Marius Otting
Abstract:
Handball has received growing interest during the last years, including academic research for many different aspects of the sport. On the other hand modelling the outcome of the game has attracted less interest mainly because of the additional challenges that occur. Data analysis has revealed that the number of goals scored by each team are under-dispersed relative to a Poisson distribution and he…
▽ More
Handball has received growing interest during the last years, including academic research for many different aspects of the sport. On the other hand modelling the outcome of the game has attracted less interest mainly because of the additional challenges that occur. Data analysis has revealed that the number of goals scored by each team are under-dispersed relative to a Poisson distribution and hence new models are needed for this purpose. Here we propose to circumvent the problem by modelling the score difference. This removes the need for special models since typical models for integer data like the Skellam distribution can provide sufficient fit and thus reveal some of the characteristics of the game. In the present paper we propose some models starting from a Skellam regression model and also considering zero inflated versions as well as other discrete distributions in $\mathbb Z$. Furthermore, we develop some bivariate models using copulas to model the two halves of the game and thus providing insights on the game. Data from German Bundesliga are used to show the potential of the new models.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Multinomial mixture for spatial data
Authors:
Anna Nalpantidi,
Dimitris Karlis,
Panagiotis Papastamoulis
Abstract:
The purpose of this paper is to extend standard finite mixture models in the context of multinomial mixtures for spatial data, in order to cluster geographical units according to demographic characteristics. The spatial information is incorporated on the model through the mixing probabilities of each component. To be more specific, a Gibbs distribution is assumed for prior probabilities. In this w…
▽ More
The purpose of this paper is to extend standard finite mixture models in the context of multinomial mixtures for spatial data, in order to cluster geographical units according to demographic characteristics. The spatial information is incorporated on the model through the mixing probabilities of each component. To be more specific, a Gibbs distribution is assumed for prior probabilities. In this way, assignment of each observation is affected by neighbors' cluster and spatial dependence is included in the model. Estimation is based on a modified EM algorithm which is enriched by an extra, initial step for approximating the field. The simulated field algorithm is used in this initial step. The presented model will be used for clustering municipalities of Attica with respect to age distribution of residents.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
Extending the Dixon and Coles model: an application to women's football data
Authors:
Rouven Michels,
Marius Ötting,
Dimitris Karlis
Abstract:
The prevalent model by Dixon and Coles (1997) extends the double Poisson model where two independent Poisson distributions model the number of goals scored by each team by moving probabilities between the scores 0-0, 0-1, 1-0, and 1-1. We show that this is a special case of a multiplicative model known as the Sarmanov family. Based on this family, we create more suitable models by moving probabili…
▽ More
The prevalent model by Dixon and Coles (1997) extends the double Poisson model where two independent Poisson distributions model the number of goals scored by each team by moving probabilities between the scores 0-0, 0-1, 1-0, and 1-1. We show that this is a special case of a multiplicative model known as the Sarmanov family. Based on this family, we create more suitable models by moving probabilities between scores and employing other discrete distributions. We apply the new models to women's football scores, which exhibit some characteristics different than that of men's football.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Optimal subdata selection for linear model selection
Authors:
Vasilis Chasiotis,
Dimitris Karlis
Abstract:
If the assumed model does not accurately capture the underlying structure of the data, a statistical method is likely to yield sub-optimal results, and so model selection is crucial in order to conduct any statistical analysis. However, in case of massive datasets, the selection of an appropriate model from a large pool of candidates becomes computationally challenging, and limited research has be…
▽ More
If the assumed model does not accurately capture the underlying structure of the data, a statistical method is likely to yield sub-optimal results, and so model selection is crucial in order to conduct any statistical analysis. However, in case of massive datasets, the selection of an appropriate model from a large pool of candidates becomes computationally challenging, and limited research has been conducted on data selection for model selection. In this study, we conduct subdata selection based on the A-optimality criterion, allowing to perform model selection on a smaller subset of the data. We evaluate our approach based on the probability of selecting the best model and on the estimation efficiency through simulation experiments and two real data applications.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
On the selection of optimal subdata for big data regression based on leverage scores
Authors:
Vasilis Chasiotis,
Dimitris Karlis
Abstract:
The demand of computational resources for the modeling process increases as the scale of the datasets does, since traditional approaches for regression involve inverting huge data matrices. The main problem relies on the large data size, and so a standard approach is subsampling that aims at obtaining the most informative portion of the big data. In the current paper, we explore an existing approa…
▽ More
The demand of computational resources for the modeling process increases as the scale of the datasets does, since traditional approaches for regression involve inverting huge data matrices. The main problem relies on the large data size, and so a standard approach is subsampling that aims at obtaining the most informative portion of the big data. In the current paper, we explore an existing approach based on leverage scores, proposed for subdata selection in linear model discrimination. Our objective is to propose the aforementioned approach for selecting the most informative data points to estimate unknown parameters in both the first-order linear model and a model with interactions. We conclude that the approach based on leverage scores improves existing approaches, providing simulation experiments as well as a real data application.
△ Less
Submitted 5 July, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
Subdata selection for big data regression: an improved approach
Authors:
Vasilis Chasiotis,
Dimitris Karlis
Abstract:
In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may suffer due to the large sample size, since they involve inverting huge data matrices or even because the data cannot fit to the memory. Proposed approaches are…
▽ More
In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may suffer due to the large sample size, since they involve inverting huge data matrices or even because the data cannot fit to the memory. Proposed approaches are based on selecting representative subdata to run the regression. Existing approaches select the subdata using information criteria and/or properties from orthogonal arrays. In the present paper we improve existing algorithms providing a new algorithm that is based on D-optimality approach. We provide simulation evidence for its performance. Evidence about the parameters of the proposed algorithm is also provided in order to clarify the trade-offs between execution time and information gain. Real data applications are also provided.
△ Less
Submitted 17 April, 2024; v1 submitted 29 April, 2023;
originally announced May 2023.
-
Piecewise survival models: a change-point analysis on herpes zoster associated pain data revisited and extended
Authors:
Dimitra Eleftheriou,
Dimitris Karlis
Abstract:
For many diseases it is reasonable to assume that the hazard rate is not constant across time, but also that it changes in different time intervals. To capture this, we work here with a piecewise survival model. One of the major problems in such piecewise models is to determine the time points of change of the hazard rate. From the practical point of view this can provide very important informatio…
▽ More
For many diseases it is reasonable to assume that the hazard rate is not constant across time, but also that it changes in different time intervals. To capture this, we work here with a piecewise survival model. One of the major problems in such piecewise models is to determine the time points of change of the hazard rate. From the practical point of view this can provide very important information as it may reflect changes in the progress of a disease. We present piecewise Weibull regression models with covariates. The time points where change occurs are assumed unknown and need to be estimated. The equality of hazard rates across the distinct phases is also examined to verify the exact number of phases. An example based on herpes zoster data has been used to demonstrate the usefulness of the developed methodology.
△ Less
Submitted 7 December, 2021;
originally announced December 2021.
-
Bayesian inference for transportation origin-destination matrices: the Poisson-inverse Gaussian and other Poisson mixtures
Authors:
Konstantinos Perrakis,
Dimitris Karlis,
Mario Cools,
Davy Janssens
Abstract:
In this paper we present Poisson mixture approaches for origin-destination (OD) modeling in transportation analysis. We introduce covariate-based models which incorporate different transport modeling phases and also allow for direct probabilistic inference on link traffic based on Bayesian predictions. Emphasis is placed on the Poisson-inverse Gaussian as an alternative to the commonly-used Poisso…
▽ More
In this paper we present Poisson mixture approaches for origin-destination (OD) modeling in transportation analysis. We introduce covariate-based models which incorporate different transport modeling phases and also allow for direct probabilistic inference on link traffic based on Bayesian predictions. Emphasis is placed on the Poisson-inverse Gaussian as an alternative to the commonly-used Poisson-gamma and Poisson-lognormal models. We present a first full Bayesian formulation and demonstrate that the Poisson-inverse Gaussian is particularly suited for OD analysis due to desirable marginal and hierarchical properties. In addition, the integrated nested Laplace approximation (INLA) is considered as an alternative to Markov chain Monte Carlo and the two methodologies are compared under specific modeling assumptions. The case study is based on 2001 Belgian census data and focuses on a large, sparsely-distributed OD matrix containing trip information for 308 Flemish municipalities.
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
Infinite mixtures of multivariate normal-inverse Gaussian distributions for clustering of skewed data
Authors:
Yuan Fang,
Dimitris Karlis,
Sanjeena Subedi
Abstract:
Mixtures of multivariate normal inverse Gaussian (MNIG) distributions can be used to cluster data that exhibit features such as skewness and heavy tails. However, for cluster analysis, using a traditional finite mixture model framework, either the number of components needs to be known $a$-$priori$ or needs to be estimated $a$-$posteriori$ using some model selection criterion after deriving result…
▽ More
Mixtures of multivariate normal inverse Gaussian (MNIG) distributions can be used to cluster data that exhibit features such as skewness and heavy tails. However, for cluster analysis, using a traditional finite mixture model framework, either the number of components needs to be known $a$-$priori$ or needs to be estimated $a$-$posteriori$ using some model selection criterion after deriving results for a range of possible number of components. However, different model selection criteria can sometimes result in different number of components yielding uncertainty. Here, an infinite mixture model framework, also known as Dirichlet process mixture model, is proposed for the mixtures of MNIG distributions. This Dirichlet process mixture model approach allows the number of components to grow or decay freely from 1 to $\infty$ (in practice from 1 to $N$) and the number of components is inferred along with the parameter estimates in a Bayesian framework thus alleviating the need for model selection criteria. We provide real data applications with benchmark datasets as well as a small simulation experiment to compare with other existing models. The proposed method provides competitive clustering results to other clustering approaches for both simulation and real data and parameter recovery are illustrated using simulation studies.
△ Less
Submitted 11 May, 2020;
originally announced May 2020.
-
A Bayesian approach for clustering skewed data using mixtures of multivariate normal-inverse Gaussian distributions
Authors:
Yuan Fang,
Dimitris Karlis,
Sanjeena Subedi
Abstract:
Non-Gaussian mixture models are gaining increasing attention for mixture model-based clustering particularly when dealing with data that exhibit features such as skewness and heavy tails. Here, such a mixture distribution is presented, based on the multivariate normal inverse Gaussian (MNIG) distribution. For parameter estimation of the mixture, a Bayesian approach via Gibbs sampler is used; for t…
▽ More
Non-Gaussian mixture models are gaining increasing attention for mixture model-based clustering particularly when dealing with data that exhibit features such as skewness and heavy tails. Here, such a mixture distribution is presented, based on the multivariate normal inverse Gaussian (MNIG) distribution. For parameter estimation of the mixture, a Bayesian approach via Gibbs sampler is used; for this, a novel approach to simulate univariate generalized inverse Gaussian random variables and matrix generalized inverse Gaussian random matrices is provided. The proposed algorithm will be applied to both simulated and real data. Through simulation studies and real data analysis, we show parameter recovery and that our approach provides competitive clustering results compared to other clustering approaches.
△ Less
Submitted 5 May, 2020;
originally announced May 2020.
-
Clustering Discrete-Valued Time Series
Authors:
Tyler Roick,
Dimitris Karlis,
Paul D. McNicholas
Abstract:
There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model…
▽ More
There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model, several existing techniques such as the selection of the number of clusters, estimation using expectation-maximization and model selection are applicable. The proposed model is then demonstrated on real data to illustrate its clustering applications.
△ Less
Submitted 27 March, 2020; v1 submitted 26 January, 2019;
originally announced January 2019.
-
An integer-valued time series model for multivariate surveillance
Authors:
Xanthi Pedeli,
Dimitris Karlis
Abstract:
In recent days different types of surveillance data are becoming available for public health reasons. In most cases several variables are monitored and events of different types are reported. As the amount of surveillance data increases, statistical methods that can effectively address multivariate surveillance scenarios are demanded. Even though research activity in this field is increasing rapid…
▽ More
In recent days different types of surveillance data are becoming available for public health reasons. In most cases several variables are monitored and events of different types are reported. As the amount of surveillance data increases, statistical methods that can effectively address multivariate surveillance scenarios are demanded. Even though research activity in this field is increasing rapidly in recent years, only a few approaches have simultaneously addressed the integer-valued property of the data and its correlation (both time correlation and cross correlation) structure. In this paper, we suggest a multivariate integer-valued autoregressive model that allows for both serial and cross correlation between the series and can easily accommodate overdispersion and covariate information. Moreover, its structure implies a natural decomposition into an endemic and an epidemic component, a common distinction in dynamic models for infectious disease counts. Detection of disease outbreaks is achieved through the comparison of surveillance data with one-step-ahead predictions obtained after fitting the suggested model to a set of clean historical data. The performance of the suggested model is illustrated on a trivariate series of syndromic surveillance data collected during Athens 2004 Olympic Games.
△ Less
Submitted 13 September, 2019; v1 submitted 22 May, 2018;
originally announced May 2018.
-
Model-based clustering using copulas with applications
Authors:
Ioannis Kosmidis,
Dimitris Karlis
Abstract:
The majority of model-based clustering techniques is based on multivariate Normal models and their variants. In this paper copulas are used for the construction of flexible families of models for clustering applications. The use of copulas in model-based clustering offers two direct advantages over current methods: i) the appropriate choice of copulas provides the ability to obtain a range of exot…
▽ More
The majority of model-based clustering techniques is based on multivariate Normal models and their variants. In this paper copulas are used for the construction of flexible families of models for clustering applications. The use of copulas in model-based clustering offers two direct advantages over current methods: i) the appropriate choice of copulas provides the ability to obtain a range of exotic shapes for the clusters, and ii) the explicit choice of marginal distributions for the clusters allows the modelling of multivariate data of various modes (either discrete or continuous) in a natural way. This paper introduces and studies the framework of copula-based finite mixture models for clustering applications. Estimation in the general case can be performed using standard EM, and, depending on the mode of the data, more efficient procedures are provided that can fully exploit the copula structure. The closure properties of the mixture models under marginalization are discussed, and for continuous, real-valued data parametric rotations in the sample space are introduced, with a parallel discussion on parameter identifiability depending on the choice of copulas for the components. The exposition of the methodology is accompanied and motivated by the analysis of real and artificial data.
△ Less
Submitted 2 July, 2015; v1 submitted 15 April, 2014;
originally announced April 2014.