-
$β$-integrated local depth and corresponding partitioned local depth representation
Authors:
Siyi Wang,
Alexandre Leblanc,
Paul D. McNicholas
Abstract:
A novel local depth definition, $β$-integrated local depth ($β$-ILD), is proposed as a generalization of the local depth introduced by Paindaveine and Van Bever \cite{paindaveine2013depth}, designed to quantify the local centrality of data points. $β$-ILD inherits desirable properties from global data depth and remains robust across varying locality levels. A partitioning approach for $β$-ILD is i…
▽ More
A novel local depth definition, $β$-integrated local depth ($β$-ILD), is proposed as a generalization of the local depth introduced by Paindaveine and Van Bever \cite{paindaveine2013depth}, designed to quantify the local centrality of data points. $β$-ILD inherits desirable properties from global data depth and remains robust across varying locality levels. A partitioning approach for $β$-ILD is introduced, leading to the construction of a matrix that quantifies the contribution of one point to another's local depth, providing a new interpretable measure of local centrality. These concepts are applied to classification and outlier detection tasks, demonstrating significant improvements in the performance of depth-based algorithms.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Assessing and Visualizing Matrix Variate Normality
Authors:
Nikola Pocuca,
Michael P. B. Gallaugher,
Katharine M. Clark,
Paul D. McNicholas
Abstract:
A framework for assessing the matrix variate normality of three-way data is developed. The framework comprises a visual method and a goodness of fit test based on the Mahalanobis squared distance (MSD). The MSD of multivariate and matrix variate normal estimators, respectively, are used as an assessment tool for matrix variate normality. Specifically, these are used in the form of a distance-dista…
▽ More
A framework for assessing the matrix variate normality of three-way data is developed. The framework comprises a visual method and a goodness of fit test based on the Mahalanobis squared distance (MSD). The MSD of multivariate and matrix variate normal estimators, respectively, are used as an assessment tool for matrix variate normality. Specifically, these are used in the form of a distance-distance (DD) plot as a graphical method for visualizing matrix variate normality. In addition, we employ the popular Kolmogorov-Smirnov goodness of fit test in the context of assessing matrix variate normality for three-way data. Finally, an appropriate simulation study spanning a large range of dimensions and data sizes shows that for various settings, the test proves itself highly robust.
△ Less
Submitted 7 October, 2019;
originally announced October 2019.
-
Three Skewed Matrix Variate Distributions
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas
Abstract:
Three-way data can be conveniently modelled by using matrix variate distributions. Although there has been a lot of work for the matrix variate normal distribution, there is little work in the area of matrix skew distributions. Three matrix variate distributions that incorporate skewness, as well as other flexible properties such as concentration, are discussed. Equivalences to multivariate analog…
▽ More
Three-way data can be conveniently modelled by using matrix variate distributions. Although there has been a lot of work for the matrix variate normal distribution, there is little work in the area of matrix skew distributions. Three matrix variate distributions that incorporate skewness, as well as other flexible properties such as concentration, are discussed. Equivalences to multivariate analogues are presented, and moment generating functions are derived. Maximum likelihood parameter estimation is discussed, and simulated data is used for illustration.
△ Less
Submitted 13 August, 2018; v1 submitted 8 April, 2017;
originally announced April 2017.
-
A Matrix Variate Skew-t Distribution
Authors:
Michael P. B. Gallaugher,
Paul D. McNicholas
Abstract:
Although there is ample work in the literature dealing with skewness in the multivariate setting, there is a relative paucity of work in the matrix variate paradigm. Such work is, for example, useful for modelling three-way data. A matrix variate skew-t distribution is derived based on a mean-variance matrix normal mixture. An expectation-conditional maximization algorithm is developed for paramet…
▽ More
Although there is ample work in the literature dealing with skewness in the multivariate setting, there is a relative paucity of work in the matrix variate paradigm. Such work is, for example, useful for modelling three-way data. A matrix variate skew-t distribution is derived based on a mean-variance matrix normal mixture. An expectation-conditional maximization algorithm is developed for parameter estimation. Simulated data are used for illustration.
△ Less
Submitted 12 April, 2017; v1 submitted 3 March, 2017;
originally announced March 2017.
-
Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model
Authors:
Antonio Punzo,
Paul D. McNicholas
Abstract:
The Gaussian cluster-weighted model (CWM) is a mixture of regression models with random covariates that allows for flexible clustering of a random vector composed of response variables and covariates. In each mixture component, it adopts a Gaussian distribution for both the covariates and the responses given the covariates. To robustify the approach with respect to possible elliptical heavy tailed…
▽ More
The Gaussian cluster-weighted model (CWM) is a mixture of regression models with random covariates that allows for flexible clustering of a random vector composed of response variables and covariates. In each mixture component, it adopts a Gaussian distribution for both the covariates and the responses given the covariates. To robustify the approach with respect to possible elliptical heavy tailed departures from normality, due to the presence of atypical observations, the contaminated Gaussian CWM is here introduced. In addition to the parameters of the Gaussian CWM, each mixture component of our contaminated CWM has a parameter controlling the proportion of outliers, one controlling the proportion of leverage points, one specifying the degree of contamination with respect to the response variables, and one specifying the degree of contamination with respect to the covariates. Crucially, these parameters do not have to be specified a priori, adding flexibility to our approach. Furthermore, once the model is estimated and the observations are assigned to the groups, a finer intra-group classification in typical points, outliers, good leverage points, and bad leverage points - concepts of primary importance in robust regression analysis - can be directly obtained. Relations with other mixture-based contaminated models are analyzed, identifiability conditions are provided, an expectation-conditional maximization algorithm is outlined for parameter estimation, and various implementation and operational issues are discussed. Properties of the estimators of the regression coefficients are evaluated through Monte Carlo experiments and compared to the estimators from the Gaussian CWM. A sensitivity study is also conducted based on a real data set.
△ Less
Submitted 21 September, 2014;
originally announced September 2014.
-
On nomenclature for, and the relative merits of, two formulations of skew distributions
Authors:
Adelchi Azzalini,
Ryan P. Browne,
Marc G. Genton,
Paul D. McNicholas
Abstract:
We examine some distributions used extensively within the model-based clustering literature in recent years, paying special attention to} claims that have been made about their relative efficacy. Theoretical arguments are provided as well as real data examples.
We examine some distributions used extensively within the model-based clustering literature in recent years, paying special attention to} claims that have been made about their relative efficacy. Theoretical arguments are provided as well as real data examples.
△ Less
Submitted 3 December, 2015; v1 submitted 21 February, 2014;
originally announced February 2014.
-
A LASSO-Penalized BIC for Mixture Model Selection
Authors:
Sakyajit Bhattacharya,
Paul D. McNicholas
Abstract:
The efficacy of family-based approaches to mixture model-based clustering and classification depends on the selection of parsimonious models. Current wisdom suggests the Bayesian information criterion (BIC) for mixture model selection. However, the BIC has well-known limitations, including a tendency to overestimate the number of components as well as a proclivity for, often drastically, underesti…
▽ More
The efficacy of family-based approaches to mixture model-based clustering and classification depends on the selection of parsimonious models. Current wisdom suggests the Bayesian information criterion (BIC) for mixture model selection. However, the BIC has well-known limitations, including a tendency to overestimate the number of components as well as a proclivity for, often drastically, underestimating the number of components in higher dimensions. While the former problem might be soluble through merging components, the latter is impossible to mitigate in clustering and classification applications. In this paper, a LASSO-penalized BIC (LPBIC) is introduced to overcome this problem. This approach is illustrated based on applications of extensions of mixtures of factor analyzers, where the LPBIC is used to select both the number of components and the number of latent factors. The LPBIC is shown to match or outperform the BIC in several situations.
△ Less
Submitted 27 November, 2012;
originally announced November 2012.