Search | arXiv e-print repository

Semi-supervised Gaussian mixture modelling with a missing-data mechanism in R

Authors: Ziyang Lyu, Daniel Ahfock, Ryan Thompson, Geoffrey J. McLachlan

Abstract: Semi-supervised learning is being extensively applied to estimate classifiers from training data in which not all the labels of the feature vectors are available. We present gmmsslm, an R package for estimating the Bayes' classifier from such partially classified data in the case where the feature vector has a multivariate Gaussian (normal) distribution in each of the predefined classes. Our packa… ▽ More Semi-supervised learning is being extensively applied to estimate classifiers from training data in which not all the labels of the feature vectors are available. We present gmmsslm, an R package for estimating the Bayes' classifier from such partially classified data in the case where the feature vector has a multivariate Gaussian (normal) distribution in each of the predefined classes. Our package implements a recently proposed Gaussian mixture modelling framework that incorporates a missingness mechanism for the missing labels in which the probability of a missing label is represented via a logistic model with covariates that depend on the entropy of the feature vector. Under this framework, it has been shown that the accuracy of the Bayes' classifier formed from the Gaussian mixture model fitted to the partially classified training data can even have lower error rate than if it were estimated from the sample completely classified. This result was established in the particular case of two Gaussian classes with a common covariance matrix. Here, we focus on the effective implementation of an algorithm for multiple Gaussian classes with arbitrary covariance matrices. A strategy for initialising the algorithm is discussed and illustrated. The new package is demonstrated on some real data. △ Less

Submitted 16 April, 2024; v1 submitted 25 February, 2023; originally announced February 2023.

Comments: To appear in the Australian and New Zealand Journal of Statistics

arXiv:2202.02249 [pdf, other]

Functional Mixtures-of-Experts

Authors: Faïcel Chamroukhi, Nhat Thien Pham, Van Hà Hoang, Geoffrey J. McLachlan

Abstract: We consider the statistical analysis of heterogeneous data for prediction in situations where the observations include functions, typically time series. We extend the modeling with Mixtures-of-Experts (ME), as a framework of choice in modeling heterogeneity in data for prediction with vectorial observations, to this functional data analysis context. We first present a new family of ME models, name… ▽ More We consider the statistical analysis of heterogeneous data for prediction in situations where the observations include functions, typically time series. We extend the modeling with Mixtures-of-Experts (ME), as a framework of choice in modeling heterogeneity in data for prediction with vectorial observations, to this functional data analysis context. We first present a new family of ME models, named functional ME (FME) in which the predictors are potentially noisy observations, from entire functions. Furthermore, the data generating process of the predictor and the real response, is governed by a hidden discrete variable representing an unknown partition. Second, by imposing sparsity on derivatives of the underlying functional parameters via Lasso-like regularizations, we provide sparse and interpretable functional representations of the FME models called iFME. We develop dedicated expectation--maximization algorithms for Lasso-like (EM-Lasso) regularized maximum-likelihood parameter estimation strategies to fit the models. The proposed models and algorithms are studied in simulated scenarios and in applications to two real data sets, and the obtained results demonstrate their performance in accurately capturing complex nonlinear relationships and in clustering the heterogeneous regression data. △ Less

Submitted 20 December, 2023; v1 submitted 4 February, 2022; originally announced February 2022.

MSC Class: 62-XX; 62R10 ACM Class: G.3

arXiv:2104.04046 [pdf, other]

Semi-Supervised Learning of Classifiers from a Statistical Perspective: A Brief Review

Authors: Daniel Ahfock, Geoffrey J. McLachlan

Abstract: There has been increasing attention to semi-supervised learning (SSL) approaches in machine learning to forming a classifier in situations where the training data for a classifier consists of a limited number of classified observations but a much larger number of unclassified observations. This is because the procurement of classified data can be quite costly due to high acquisition costs and subs… ▽ More There has been increasing attention to semi-supervised learning (SSL) approaches in machine learning to forming a classifier in situations where the training data for a classifier consists of a limited number of classified observations but a much larger number of unclassified observations. This is because the procurement of classified data can be quite costly due to high acquisition costs and subsequent financial, time, and ethical issues that can arise in attempts to provide the true class labels for the unclassified data that have been acquired. We provide here a review of statistical SSL approaches to this problem, focussing on the recent result that a classifier formed from a partially classified sample can actually have smaller expected error rate than that if the sample were completely classified. △ Less

Submitted 9 November, 2021; v1 submitted 8 April, 2021; originally announced April 2021.

arXiv:2104.02888 [pdf, other]

Data-fusion using factor analysis and low-rank matrix completion

Authors: Daniel Ahfock, Saumyadipta Pyne, Geoffrey J. McLachlan

Abstract: Data-fusion involves the integration of multiple related datasets. The statistical file-matching problem is a canonical data-fusion problem in multivariate analysis, where the objective is to characterise the joint distribution of a set of variables when only strict subsets of marginal distributions have been observed. Estimation of the covariance matrix of the full set of variables is challenging… ▽ More Data-fusion involves the integration of multiple related datasets. The statistical file-matching problem is a canonical data-fusion problem in multivariate analysis, where the objective is to characterise the joint distribution of a set of variables when only strict subsets of marginal distributions have been observed. Estimation of the covariance matrix of the full set of variables is challenging given the missing-data pattern. Factor analysis models use lower-dimensional latent variables in the data-generating process, and this introduces low-rank components in the complete-data matrix and the population covariance matrix. The low-rank structure of the factor analysis model can be exploited to estimate the full covariance matrix from incomplete data via low-rank matrix completion. We prove the identifiability of the factor analysis model in the statistical file-matching problem under conditions on the number of factors and the number of shared variables over the observed marginal subsets. Additionally, we provide an EM algorithm for parameter estimation. On several real datasets, the factor model gives smaller reconstruction errors in file-matching problems than the common approaches for low-rank matrix completion. △ Less

Submitted 6 April, 2021; originally announced April 2021.

arXiv:2104.02872 [pdf, other]

Harmless label noise and informative soft-labels in supervised classification

Authors: Daniel Ahfock, Geoffrey J. McLachlan

Abstract: Manual labelling of training examples is common practice in supervised learning. When the labelling task is of non-trivial difficulty, the supplied labels may not be equal to the ground-truth labels, and label noise is introduced into the training dataset. If the manual annotation is carried out by multiple experts, the same training example can be given different class assignments by different ex… ▽ More Manual labelling of training examples is common practice in supervised learning. When the labelling task is of non-trivial difficulty, the supplied labels may not be equal to the ground-truth labels, and label noise is introduced into the training dataset. If the manual annotation is carried out by multiple experts, the same training example can be given different class assignments by different experts, which is indicative of label noise. In the framework of model-based classification, a simple, but key observation is that when the manual labels are sampled using the posterior probabilities of class membership, the noisy labels are as valuable as the ground-truth labels in terms of statistical information. A relaxation of this process is a random effects model for imperfect labelling by a group that uses approximate posterior probabilities of class membership. The relative efficiency of logistic regression using the noisy labels compared to logistic regression using the ground-truth labels can then be derived. The main finding is that logistic regression can be robust to label noise when label noise and classification difficulty are positively correlated. In particular, when classification difficulty is the only source of label errors, multiple sets of noisy labels can supply more information for the estimation of a classification rule compared to the single set of ground-truth labels. △ Less

Submitted 6 April, 2021; originally announced April 2021.

arXiv:2009.10622 [pdf, other]

Non-asymptotic oracle inequalities for the Lasso in high-dimensional mixture of experts

Authors: TrungTin Nguyen, Hien D Nguyen, Faicel Chamroukhi, Geoffrey J McLachlan

Abstract: We investigate the estimation properties of the mixture of experts (MoE) model in a high-dimensional setting, where the number of predictors is much larger than the sample size, and for which the literature is particularly lacking in theoretical results. We consider the class of softmax-gated Gaussian MoE (SGMoE) models, defined as MoE models with softmax gating functions and Gaussian experts, and… ▽ More We investigate the estimation properties of the mixture of experts (MoE) model in a high-dimensional setting, where the number of predictors is much larger than the sample size, and for which the literature is particularly lacking in theoretical results. We consider the class of softmax-gated Gaussian MoE (SGMoE) models, defined as MoE models with softmax gating functions and Gaussian experts, and focus on the theoretical properties of their $l_1$-regularized estimation via the Lasso. To the best of our knowledge, we are the first to investigate the $l_1$-regularization properties of SGMoE models from a non-asymptotic perspective, under the mildest assumptions, namely the boundedness of the parameter space. We provide a lower bound on the regularization parameter of the Lasso penalty that ensures non-asymptotic theoretical control of the Kullback--Leibler loss of the Lasso estimator for SGMoE models. Finally, we carry out a simulation study to empirically validate our theoretical findings. △ Less

Submitted 2 July, 2024; v1 submitted 22 September, 2020; originally announced September 2020.

Comments: Revise and add numerical experiments

MSC Class: 62E17 (Primary) 62H12; 62H30 (Secondary)

arXiv:2005.06848 [pdf, ps, other]

Multi-Node EM Algorithm for Finite Mixture Models

Authors: Sharon X. Lee, Geoffrey J. McLachlan, Kaleb L. Leemaqz

Abstract: Finite mixture models are powerful tools for modelling and analyzing heterogeneous data. Parameter estimation is typically carried out using maximum likelihood estimation via the Expectation-Maximization (EM) algorithm. Recently, the adoption of flexible distributions as component densities has become increasingly popular. Often, the EM algorithm for these models involves complicated expressions t… ▽ More Finite mixture models are powerful tools for modelling and analyzing heterogeneous data. Parameter estimation is typically carried out using maximum likelihood estimation via the Expectation-Maximization (EM) algorithm. Recently, the adoption of flexible distributions as component densities has become increasingly popular. Often, the EM algorithm for these models involves complicated expressions that are time-consuming to evaluate numerically. In this paper, we describe a parallel implementation of the EM-algorithm suitable for both single-threaded and multi-threaded processors and for both single machine and multiple-node systems. Numerical experiments are performed to demonstrate the potential performance gain n different settings. Comparison is also made across two commonly used platforms - R and MATLAB. For illustration, a fairly general mixture model is used in the comparison. △ Less

Submitted 14 May, 2020; originally announced May 2020.

Comments: 12 Pages,1 figure

arXiv:2004.06237 [pdf, other]

Estimation of Classification Rules from Partially Classified Data

Authors: Geoffrey J. McLachlan, Daniel Ahfock

Abstract: We consider the situation where the observed sample contains some observations whose class of origin is known (that is, they are classified with respect to the g underlying classes of interest), and where the remaining observations in the sample are unclassified (that is, their class labels are unknown). For class-conditional distributions taken to be known up to a vector of unknown parameters, th… ▽ More We consider the situation where the observed sample contains some observations whose class of origin is known (that is, they are classified with respect to the g underlying classes of interest), and where the remaining observations in the sample are unclassified (that is, their class labels are unknown). For class-conditional distributions taken to be known up to a vector of unknown parameters, the aim is to estimate the Bayes' rule of allocation for the allocation of subsequent unclassified observations. Estimation on the basis of both the classified and unclassified data can be undertaken in a straightforward manner by fitting a g-component mixture model by maximum likelihood (ML) via the EM algorithm in the situation where the observed data can be assumed to be an observed random sample from the adopted mixture distribution. This assumption applies if the missing-data mechanism is ignorable in the terminology pioneered by Rubin (1976). An initial likelihood approach was to use the so-called classification ML approach whereby the missing labels are taken to be parameters to be estimated along with the parameters of the class-conditional distributions. However, as it can lead to inconsistent estimates, the focus of attention switched to the mixture ML approach after the appearance of the EM algorithm (Dempster et al., 1977). Particular attention is given here to the asymptotic relative efficiency (ARE) of the Bayes' rule estimated from a partially classified sample. Lastly, we consider briefly some recent results in situations where the missing label pattern is non-ignorable for the purposes of ML estimation for the mixture model. △ Less

Submitted 13 April, 2020; originally announced April 2020.

Comments: Based on invited talk given to the 16th Conference of the International Federation of Classification Societies in Thessaloniki, August 2019

arXiv:1910.09189 [pdf, other]

An Apparent Paradox: A Classifier Trained from a Partially Classified Sample May Have Smaller Expected Error Rate Than That If the Sample Were Completely Classified

Authors: Daniel Ahfock, Geoffrey J. McLachlan

Abstract: There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence assuming that the labels of the unclassified features are randomly missing or their miss… ▽ More There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence assuming that the labels of the unclassified features are randomly missing or their missing-label mechanism is simply ignored, the expected error rate of a classifier formed from a partially classified sample is greater than that if the sample were completely classified. We propose to treat the labels of the unclassified features as missing data and to introduce a framework for their missingness in situations where these labels are not randomly missing. An examination of several partially classified data sets in the literature suggests that the unclassified features are not occurring at random but rather tend to be concentrated in regions of relatively high entropy in the feature space. Here in the context of two normal classes with a common covariance matrix we consider the situation where the missingness of the labels of the unclassified features can be modelled by a logistic model in which the probability of a missing label for a feature depends on its entropy. Rather paradoxically, we show that the classifier so formed from the partially classified sample may have smaller expected error rate that that if the sample were completely classified. △ Less

Submitted 6 November, 2019; v1 submitted 21 October, 2019; originally announced October 2019.

arXiv:1904.12057 [pdf, ps, other]

Comment on "Hidden truncation hyperbolic distributions, finite mixtures thereof and their application for clustering" Murray, Browne, and \McNicholas

Authors: Geoffrey J. McLachlan, Sharon X. Lee

Abstract: We comment on the paper of Murray, Browne, and McNicholas (2017), who proposed mixtures of skew distributions, which they termed hidden truncation hyperbolic (HTH). They recently made a clarification (Murray, Browne, McNicholas, 2019) concerning their claim that the so-called CFUST distribution is a special case of the HTH distribution. There are also some other matters in the original version of… ▽ More We comment on the paper of Murray, Browne, and McNicholas (2017), who proposed mixtures of skew distributions, which they termed hidden truncation hyperbolic (HTH). They recently made a clarification (Murray, Browne, McNicholas, 2019) concerning their claim that the so-called CFUST distribution is a special case of the HTH distribution. There are also some other matters in the original version of the paper that were in need of clarification as discussed here. △ Less

Submitted 26 April, 2019; originally announced April 2019.

Comments: 7 pages

arXiv:1904.02883 [pdf, other]

On missing label patterns in semi-supervised learning

Authors: Daniel Ahfock, Geoffrey J. McLachlan

Abstract: We investigate model based classification with partially labelled training data. In many biostatistical applications, labels are manually assigned by experts, who may leave some observations unlabelled due to class uncertainty. We analyse semi-supervised learning as a missing data problem and identify situations where the missing label pattern is non-ignorable for the purposes of maximum likelihoo… ▽ More We investigate model based classification with partially labelled training data. In many biostatistical applications, labels are manually assigned by experts, who may leave some observations unlabelled due to class uncertainty. We analyse semi-supervised learning as a missing data problem and identify situations where the missing label pattern is non-ignorable for the purposes of maximum likelihood estimation. In particular, we find that a relationship between classification difficulty and the missing label pattern implies a non-ignorable missingness mechanism. We examine a number of real datasets and conclude the pattern of missing labels is related to the difficulty of classification. We propose a joint modelling strategy involving the observed data and the missing label mechanism to account for the systematic missing labels. Full likelihood inference including the missing label mechanism can improve the efficiency of parameter estimation, and increase classification accuracy. △ Less

Submitted 5 April, 2019; originally announced April 2019.

arXiv:1903.12342 [pdf, other]

Statistical matching of non-Gaussian data

Authors: Daniel Ahfock, Saumyadipta Pyne, Geoffrey J. McLachlan

Abstract: The statistical matching problem is a data integration problem with structured missing data. The general form involves the analysis of multiple datasets that only have a strict subset of variables jointly observed across all datasets. The simplest version involves two datasets, labelled A and B, with three variables of interest $X, Y$ and $Z$. Variables $X$ and $Y$ are observed in dataset A and va… ▽ More The statistical matching problem is a data integration problem with structured missing data. The general form involves the analysis of multiple datasets that only have a strict subset of variables jointly observed across all datasets. The simplest version involves two datasets, labelled A and B, with three variables of interest $X, Y$ and $Z$. Variables $X$ and $Y$ are observed in dataset A and variables $X$ and $Z$ are observed in dataset $B$. Statistical inference is complicated by the absence of joint $(Y, Z)$ observations. Parametric modelling can be challenging due to identifiability issues and the difficulty of parameter estimation. We develop computationally feasible procedures for the statistical matching of non-Gaussian data using suitable data augmentation schemes and identifiability constraints. Nearest-neighbour imputation is a common alternative technique due to its ease of use and generality. Nearest-neighbour matching is based on a conditional independence assumption that may be inappropriate for non-Gaussian data. The violation of the conditional independence assumption can lead to improper imputations. We compare model based approaches to nearest-neighbour imputation on a number of flow cytometry datasets and find that the model based approach can address some of the weaknesses of the nonparametric nearest-neighbour technique. △ Less

Submitted 28 March, 2019; originally announced March 2019.

arXiv:1902.03335 [pdf, other]

Mini-batch learning of exponential family finite mixture models

Authors: H D Nguyen, F Forbes, G J McLachlan

Abstract: Mini-batch algorithms have become increasingly popular due to the requirement for solving optimization problems, based on large-scale data sets. Using an existing online expectation-{}-maximization (EM) algorithm framework, we demonstrate how mini-batch (MB) algorithms may be constructed, and propose a scheme for the stochastic stabilization of the constructed mini-batch algorithms. Theoretical re… ▽ More Mini-batch algorithms have become increasingly popular due to the requirement for solving optimization problems, based on large-scale data sets. Using an existing online expectation-{}-maximization (EM) algorithm framework, we demonstrate how mini-batch (MB) algorithms may be constructed, and propose a scheme for the stochastic stabilization of the constructed mini-batch algorithms. Theoretical results regarding the convergence of the mini-batch EM algorithms are presented. We then demonstrate how the mini-batch framework may be applied to conduct maximum likelihood (ML) estimation of mixtures of exponential family distributions, with emphasis on ML estimation for mixtures of normal distributions. Via a simulation study, we demonstrate that the mini-batch algorithm for mixtures of normal distributions can outperform the standard EM algorithm. Further evidence of the performance of the mini-batch framework is provided via an application to the famous MNIST data set. △ Less

Submitted 5 September, 2019; v1 submitted 8 February, 2019; originally announced February 2019.

arXiv:1810.04842 [pdf, ps, other]

On formulations of skew factor models: skew errors versus skew factors

Authors: Sharon X. Lee, Geoffrey J. McLachlan

Abstract: In the past few years, there have been a number of proposals for generalizing the factor analysis (FA) model and its mixture version (known as mixtures of factor analyzers (MFA)) using non-normal and asymmetric distributions. These models adopt various types of skew densities for either the factors or the errors. While the relationships between various choices of skew distributions have been discu… ▽ More In the past few years, there have been a number of proposals for generalizing the factor analysis (FA) model and its mixture version (known as mixtures of factor analyzers (MFA)) using non-normal and asymmetric distributions. These models adopt various types of skew densities for either the factors or the errors. While the relationships between various choices of skew distributions have been discussed in the literature, the differences between placing the assumption of skewness on the factors or on the errors have not been closely studied. This paper examines these formulations and discusses the connections between these two types of formulations for skew factor models. In doing so, we introduce a further formulation that unifies these two formulations; that is, placing a skew distribution on both the factors and the errors. △ Less

Submitted 20 November, 2018; v1 submitted 11 October, 2018; originally announced October 2018.

arXiv:1805.04394 [pdf, other]

False discovery rate control under reduced precision computation for analysis of neuroimaging data

Authors: Hien D. Nguyen, Yohan Yee, Geoffrey J. McLachlan, Jason P. Lerch

Abstract: The mitigation of false positives is an important issue when conducting multiple hypothesis testing. The most popular paradigm for false positives mitigation in high-dimensional applications is via the control of the false discovery rate (FDR). Multiple testing data from neuroimaging experiments can be very large, and reduced precision storage of such data is often required. Reduced precision comp… ▽ More The mitigation of false positives is an important issue when conducting multiple hypothesis testing. The most popular paradigm for false positives mitigation in high-dimensional applications is via the control of the false discovery rate (FDR). Multiple testing data from neuroimaging experiments can be very large, and reduced precision storage of such data is often required. Reduced precision computation is often a problem in the analysis of legacy data and data arising from legacy pipelines. We present a method for FDR control that is applicable in cases where only p\text{-values} or test statistics (with common and known null distribution) are available, and when those p\text{-values} or test statistics are encoded in a reduced precision format. Our method is based on an empirical-Bayes paradigm where the probit transformation of the p\text{-values} (called the z\text{-scores}) are modeled as a two-component mixture of normal distributions. Due to the reduced precision of the p\text{-values} or test statistics, the usual approach for fitting mixture models may not be feasible. We instead use a binned-data technique, which can be proved to consistently estimate the z\text{-score} distribution parameters under mild correlation assumptions, as is often the case in neuroimaging data. A simulation study shows that our methodology is competitive when compared with popular alternatives, especially with data in the presence of misspecification. We demonstrate the applicability of our methodology in practice via a brain imaging study of mice. △ Less

Submitted 16 July, 2018; v1 submitted 11 May, 2018; originally announced May 2018.

arXiv:1804.08365 [pdf, other]

Positive data kernel density estimation via the logKDE package for R

Authors: Andrew T. Jones, Hien D. Nguyen, Geoffrey J. McLachlan

Abstract: Kernel density estimators (KDEs) are ubiquitous tools for nonparametric estimation of probability density functions (PDFs), when data are obtained from unknown data generating processes. The KDEs that are typically available in software packages are defined, and designed, to estimate real-valued data. When applied to positive data, these typical KDEs do not yield bona fide PDFs. A log-transformati… ▽ More Kernel density estimators (KDEs) are ubiquitous tools for nonparametric estimation of probability density functions (PDFs), when data are obtained from unknown data generating processes. The KDEs that are typically available in software packages are defined, and designed, to estimate real-valued data. When applied to positive data, these typical KDEs do not yield bona fide PDFs. A log-transformation methodology can be applied to produce a nonparametric estimator that is appropriate and yields proper PDFs over positive supports. We call the KDEs obtained via this transformation log-KDEs. We derive expressions for the pointwise biases, variances, and mean-squared errors of the log- KDEs that are obtained via various kernel functions. Mean integrated squared error (MISE) and asymptotic MISE results are also provided and a plug-in rule for log-KDE bandwidths is derived. We demonstrate the log-KDEs methodology via our R package, logKDE. Real data case studies are provided to demonstrate the log-KDE approach. △ Less

Submitted 5 August, 2018; v1 submitted 23 April, 2018; originally announced April 2018.

arXiv:1804.08341 [pdf, other]

Randomized Mixture Models for Probability Density Approximation and Estimation

Authors: Hien D. Nguyen, Dianhui Wang, Geoffrey J. McLachlan

Abstract: Randomized neural networks (NNs) are an interesting alternative to conventional NNs that are more used for data modeling. The random vector functional-link (RVFL) network is an established and theoretically well-grounded randomized learning model. A key theoretical result for RVFL networks is that they provide universal approximation for continuous maps, on average, almost surely. We specialize an… ▽ More Randomized neural networks (NNs) are an interesting alternative to conventional NNs that are more used for data modeling. The random vector functional-link (RVFL) network is an established and theoretically well-grounded randomized learning model. A key theoretical result for RVFL networks is that they provide universal approximation for continuous maps, on average, almost surely. We specialize and modify this result, and show that RFVL networks can provide functional approximations that converge in Kullback-Leibler divergence, when the target function is a probability density function. Expanding on the approximation results, we demonstrate the the RFVL networks lead to a simple randomized mixture model (MM) construction for density estimation from random data. An expectation-maximization (EM) algorithm is derived for the maximum likelihood estimation of our randomized MM. The EM algorithm is proved to be globally convergent and the maximum likelihood estimator is proved to be consistent. A set of simulation studies is given to provide empirical evidence towards our approximation and density estimation results. △ Less

Submitted 23 April, 2018; originally announced April 2018.

arXiv:1802.02467 [pdf, other]

Mixtures of Factor Analyzers with Fundamental Skew Symmetric Distributions

Authors: Sharon X. Lee, Tsung-I Lin, Geoffrey J. McLachlan

Abstract: Mixtures of factor analyzers (MFA) provide a powerful tool for modelling high-dimensional datasets. In recent years, several generalizations of MFA have been developed where the normality assumption of the factors and/or of the errors was relaxed to allow for skewness in the data. However, due to the form of the adopted component densities, the distribution of the factors/errors in most of these m… ▽ More Mixtures of factor analyzers (MFA) provide a powerful tool for modelling high-dimensional datasets. In recent years, several generalizations of MFA have been developed where the normality assumption of the factors and/or of the errors was relaxed to allow for skewness in the data. However, due to the form of the adopted component densities, the distribution of the factors/errors in most of these models is typically limited to modelling skewness oncentrated in a single direction. Here, we introduce a more flexible finite mixture of factor analyzers based on the class of scale mixtures of canonical fundamental skew normal (SMCFUSN) distributions. This very general class of skew distributions can capture various types of skewness and asymmetry in the data. In particular, the proposed mixture model of SMCFUSN factor analyzers(SMCFUSNFA) can simultaneously accommodate multiple directions of skewness. As such, it encapsulates many commonly used models as special and/or limiting cases, such as models of some versions of skew normal and skew t-factor analyzers, and skew hyperbolic factor analyzers. For illustration, we focus on the t-distribution member of the class of SMCFUSN distributions, leading to mixtures of canonical fundamental skew t-factor analyzers (CFUSTFA). Parameter estimation can be carried out by maximum likelihood via an EM-type algorithm. The usefulness and potential of the proposed model are demonstrated using two real datasets. △ Less

Submitted 26 October, 2018; v1 submitted 7 February, 2018; originally announced February 2018.

arXiv:1711.06929 [pdf, ps, other]

Deep Gaussian Mixture Models

Authors: Cinzia Viroli, Geoffrey J. McLachlan

Abstract: Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Th… ▽ More Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Thus, the deep mixture model consists of a set of nested mixtures of linear models, which globally provide a nonlinear model able to describe the data in a very flexible way. In order to avoid overparameterized solutions, dimension reduction by factor models can be applied at each layer of the architecture thus resulting in deep mixtures of factor analysers. △ Less

Submitted 18 November, 2017; originally announced November 2017.

Comments: 19 pages, 4 figures

arXiv:1705.04651 [pdf, other]

Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization--Minimization Algorithm Approach

Authors: Hien D. Nguyen, Geoffrey J. McLachlan

Abstract: Support vector machines (SVMs) are an important tool in modern data analysis. Traditionally, support vector machines have been fitted via quadratic programming, either using purpose-built or off-the-shelf algorithms. We present an alternative approach to SVM fitting via the majorization--minimization (MM) paradigm. Algorithms that are derived via MM algorithm constructions can be shown to monotoni… ▽ More Support vector machines (SVMs) are an important tool in modern data analysis. Traditionally, support vector machines have been fitted via quadratic programming, either using purpose-built or off-the-shelf algorithms. We present an alternative approach to SVM fitting via the majorization--minimization (MM) paradigm. Algorithms that are derived via MM algorithm constructions can be shown to monotonically decrease their objectives at each iteration, as well as be globally convergent to stationary points. We demonstrate the construction of iteratively-reweighted least-squares (IRLS) algorithms, via the MM paradigm, for SVM risk minimization problems involving the hinge, least-square, squared-hinge, and logistic losses, and 1-norm, 2-norm, and elastic net penalizations. Successful implementations of our algorithms are presented via some numerical examples. △ Less

Submitted 12 May, 2017; originally announced May 2017.

arXiv:1701.04512 [pdf, other]

Some Theoretical Results Regarding the Polygonal Distribution

Authors: Hien D Nguyen, Geoffrey J McLachlan

Abstract: The polygonal distributions are a class of distributions that can be defined via the mixture of triangular distributions over the unit interval. The class includes the uniform and trapezoidal distributions, and is an alternative to the beta distribution. We demonstrate that the polygonal densities are dense in the class of continuous and concave densities with bounded second derivatives. Pointwise… ▽ More The polygonal distributions are a class of distributions that can be defined via the mixture of triangular distributions over the unit interval. The class includes the uniform and trapezoidal distributions, and is an alternative to the beta distribution. We demonstrate that the polygonal densities are dense in the class of continuous and concave densities with bounded second derivatives. Pointwise consistency and Hellinger consistency results for the maximum likelihood (ML) estimator are obtained. A useful model selection theorem is stated as well as results for a related distribution that is obtained via the pointwise square of polygonal density functions. △ Less

Submitted 16 January, 2017; originally announced January 2017.

arXiv:1612.06492 [pdf, other]

Chunked-and-Averaged Estimators for Vector Parameters

Authors: Hien D. Nguyen, Geoffrey J. McLachlan

Abstract: A divide-and-conquer method for parameter estimation is the chunked-and-averaged (CA) estimator. CA estimators have been studied for univariate parameters under independent and identically distributed (IID) sampling. We study the CA estimators of vector parameters and under non-IID sampling. A divide-and-conquer method for parameter estimation is the chunked-and-averaged (CA) estimator. CA estimators have been studied for univariate parameters under independent and identically distributed (IID) sampling. We study the CA estimators of vector parameters and under non-IID sampling. △ Less

Submitted 29 August, 2017; v1 submitted 19 December, 2016; originally announced December 2016.

arXiv:1611.03974 [pdf, other]

On approximations via convolution-defined mixture models

Authors: Hien D. Nguyen, Geoffrey J. McLachlan

Abstract: An often-cited fact regarding mixing or mixture distributions is that their density functions are able to approximate the density function of any unknown distribution to arbitrary degrees of accuracy, provided that the mixing or mixture distribution is sufficiently complex. This fact is often not made concrete. We investigate and review theorems that provide approximation bounds for mixing distrib… ▽ More An often-cited fact regarding mixing or mixture distributions is that their density functions are able to approximate the density function of any unknown distribution to arbitrary degrees of accuracy, provided that the mixing or mixture distribution is sufficiently complex. This fact is often not made concrete. We investigate and review theorems that provide approximation bounds for mixing distributions. Connections between the approximation bounds of mixing distributions and estimation bounds for the maximum likelihood estimator of finite mixtures of location- scale distributions are reviewed. △ Less

Submitted 1 March, 2018; v1 submitted 12 November, 2016; originally announced November 2016.

arXiv:1611.01602 [pdf, other]

Whole-Volume Clustering of Time Series Data from Zebrafish Brain Calcium Images via Mixture Modeling

Authors: Hien D. Nguyen, Jeremy F. P. Ullmann, Geoffrey J. McLachlan, Venkatakaushik Voleti, Wenze Li, Elizabeth M. C. Hillman, David C. Reutens, Andrew L. Janke

Abstract: Calcium is a ubiquitous messenger in neural signaling events. An increasing number of techniques are enabling visualization of neurological activity in animal models via luminescent proteins that bind to calcium ions. These techniques generate large volumes of spatially correlated time series. A model-based functional data analysis methodology via Gaussian mixtures is suggested for the clustering… ▽ More Calcium is a ubiquitous messenger in neural signaling events. An increasing number of techniques are enabling visualization of neurological activity in animal models via luminescent proteins that bind to calcium ions. These techniques generate large volumes of spatially correlated time series. A model-based functional data analysis methodology via Gaussian mixtures is suggested for the clustering of data from such visualizations is proposed. The methodology is theoretically justified and a computationally efficient approach to estimation is suggested. An example analysis of a zebrafish imaging experiment is presented. △ Less

Submitted 28 February, 2017; v1 submitted 5 November, 2016; originally announced November 2016.

arXiv:1608.05481 [pdf, other]

Faster Functional Clustering via Gaussian Mixture Models

Authors: Hien D Nguyen, Geoffrey J McLachlan, Jeremy F P Ullmann, Andrew L Janke

Abstract: Functional data analysis (FDA) is an important modern paradigm for handling infinite-dimensional data. An important task in FDA is model-based clustering, which organizes functional populations into groups via subpopulation structures. The most common approach for model-based clustering of functional data is via mixtures of linear mixed-effects models. The mixture of linear mixed-effects models (M… ▽ More Functional data analysis (FDA) is an important modern paradigm for handling infinite-dimensional data. An important task in FDA is model-based clustering, which organizes functional populations into groups via subpopulation structures. The most common approach for model-based clustering of functional data is via mixtures of linear mixed-effects models. The mixture of linear mixed-effects models (MLMM) approach requires a computationally intensive algorithm for estimation. We provide a novel Gaussian mixture model (GMM) characterization of the model-based clustering problem. We demonstrate that this GMM-based characterization allows for improved computational speeds over the MLMM approach when applied via available functions in the R programming environment. Theoretical considerations for the GMM approach are discussed. An example application to a dataset based upon calcium imaging in the larval zebrafish brain is provided as a demonstration of the effectiveness of the simpler GMM approach. △ Less

Submitted 12 February, 2017; v1 submitted 18 August, 2016; originally announced August 2016.

arXiv:1608.02797 [pdf, other]

A block EM algorithm for multivariate skew normal and skew t-mixture models

Authors: Sharon X Lee, Kaleb L Leemaqz, Geoffrey J McLachlan

Abstract: Finite mixtures of skew distributions provide a flexible tool for modelling heterogeneous data with asymmetric distributional features. However, parameter estimation via the Expectation-Maximization (EM) algorithm can become very time-consuming due to the complicated expressions involved in the E-step that are numerically expensive to evaluate. A more time-efficient implementation of the EM algori… ▽ More Finite mixtures of skew distributions provide a flexible tool for modelling heterogeneous data with asymmetric distributional features. However, parameter estimation via the Expectation-Maximization (EM) algorithm can become very time-consuming due to the complicated expressions involved in the E-step that are numerically expensive to evaluate. A more time-efficient implementation of the EM algorithm was recently proposed which allows each component of the mixture model to be evaluated in parallel. In this paper, we develop a block implementation of the EM algorithm that facilitates the calculations in the E- and M-steps to be spread across a larger number of threads. We focus on the fitting of finite mixtures of multivariate skew normal and skew t-distributions, and show that both the E- and M-steps in the EM algorithm can be modified to allow the data to be split into blocks. The approach can be easily implemented for use by multicore and multi-processor machines. It can also be applied concurrently with the recently proposed multithreaded EM algorithm to achieve further reduction in computation time. The improvement in time performance is illustrated on some real datasets. △ Less

Submitted 9 August, 2016; originally announced August 2016.

arXiv:1607.04807 [pdf, other]

Progress on a Conjecture Regarding the Triangular Distribution

Authors: Hien D Nguyen, Geoffrey J McLachlan

Abstract: Triangular distributions are a well-known class of distributions that are often used as an elementary example of a probability model. Maximum likelihood estimation of the mode parameter of the triangular distribution over the unit interval can be performed via an order statistics-based method. It had been conjectured that such a method can be conducted using only a constant number of likelihood fu… ▽ More Triangular distributions are a well-known class of distributions that are often used as an elementary example of a probability model. Maximum likelihood estimation of the mode parameter of the triangular distribution over the unit interval can be performed via an order statistics-based method. It had been conjectured that such a method can be conducted using only a constant number of likelihood function evaluations, on average, as the sample size becomes large. We prove two theorems that validate this conjecture. Graphical and numerical results are presented to supplement our proofs. △ Less

Submitted 5 November, 2016; v1 submitted 16 July, 2016; originally announced July 2016.

arXiv:1606.02054 [pdf, other]

A simple multithreaded implementation of the EM algorithm for mixture models

Authors: Sharon X Lee, Kaleb L Lee, Geoffrey J McLachlan

Abstract: Finite mixture models have been widely used for the modelling and analysis of data from heterogeneous populations. Maximum likelihood estimation of the parameters is typically carried out via the Expectation-Maximization (EM) algorithm. The complexity of the implementation of the algorithm depends on the parametric distribution that is adopted as the component densities of the mixture model. In th… ▽ More Finite mixture models have been widely used for the modelling and analysis of data from heterogeneous populations. Maximum likelihood estimation of the parameters is typically carried out via the Expectation-Maximization (EM) algorithm. The complexity of the implementation of the algorithm depends on the parametric distribution that is adopted as the component densities of the mixture model. In the case of the skew normal and skew t-distributions, for example, the E-step would involve complicated expressions that are computationally expensive to evaluate. This can become quite time-consuming for large and/or high-dimensional datasets. In this paper, we develop a multithreaded version of the EM algorithm for the fitting of finite mixture models. Due to the structure of the algorithm for these models, the E- and M-steps can be easily reformulated to be executed in parallel across multiple threads to take advantage of the processing power available in modern-day multicore machines. Our approach is simple and easy to implement, requiring only small changes to standard code. To illustrate the approach, we focus on a fairly general mixture model that includes as special or limiting cases some of the most commonly used mixture models including the normal, t-, skew normal, and skew t-mixture models. △ Less

Submitted 7 June, 2016; originally announced June 2016.

arXiv:1603.08326 [pdf, other]

A globally convergent algorithm for lasso-penalized mixture of linear regression models

Authors: Luke R. Lloyd-Jones, Hien D. Nguyen, Geoffrey J. McLachlan

Abstract: Variable selection is an old and pervasive problem in regression analysis. One solution is to impose a lasso penalty to shrink parameter estimates toward zero and perform continuous model selection. The lasso-penalized mixture of linear regressions model (L-MLR) is a class of regularization methods for the model selection problem in the fixed number of variables setting. In this article, we propos… ▽ More Variable selection is an old and pervasive problem in regression analysis. One solution is to impose a lasso penalty to shrink parameter estimates toward zero and perform continuous model selection. The lasso-penalized mixture of linear regressions model (L-MLR) is a class of regularization methods for the model selection problem in the fixed number of variables setting. In this article, we propose a new algorithm for the maximum penalized-likelihood estimation of the L-MLR model. This algorithm is constructed via the minorization--maximization algorithm paradigm. Such a construction allows for coordinate-wise updates of the parameter components, and produces globally convergent sequences of estimates that generate monotonic sequences of penalized log-likelihood values. These three features are missing in the previously presented approximate expectation-maximization algorithms. The previous difficulty in producing a globally convergent algorithm for the maximum penalized-likelihood estimation of the L-MLR model is due to the intractability of finding exact updates for the mixture model mixing proportions in the maximization-step. In our algorithm, we solve this issue by showing that it can be converted into a polynomial root finding problem. Our solution to this problem involves a polynomial basis conversion that is interesting in its own right. The method is tested in simulation and with an application to Major League Baseball salary data from the 1990s and the present day. We explore the concept of whether player salaries are associated with batting performance. △ Less

Submitted 2 May, 2016; v1 submitted 28 March, 2016; originally announced March 2016.

Comments: 38 pages, 4 tables, 2 figures

arXiv:1603.04613 [pdf, other]

doi 10.1109/LSP.2016.2586180

A Block Minorization--Maximization Algorithm for Heteroscedastic Regression

Authors: Hien D. Nguyen, Luke R. Lloyd-Jones, Geoffrey J. McLachlan

Abstract: The computation of the maximum likelihood (ML) estimator for heteroscedastic regression models is considered. The traditional Newton algorithms for the problem require matrix multiplications and inversions, which are bottlenecks in modern Big Data contexts. A new Big Data-appropriate minorization--maximization (MM) algorithm is considered for the computation of the ML estimator. The MM algorithm i… ▽ More The computation of the maximum likelihood (ML) estimator for heteroscedastic regression models is considered. The traditional Newton algorithms for the problem require matrix multiplications and inversions, which are bottlenecks in modern Big Data contexts. A new Big Data-appropriate minorization--maximization (MM) algorithm is considered for the computation of the ML estimator. The MM algorithm is proved to generate monotonically increasing sequences of likelihood values and to be convergent to a stationary point of the log-likelihood function. A distributed and parallel implementation of the MM algorithm is presented and the MM algorithm is shown to have differing time complexity to the Newton algorithm. Simulation studies demonstrate that the MM algorithm improves upon the computation time of the Newton algorithm in some practical scenarios where the number of observations is large. △ Less

Submitted 30 May, 2016; v1 submitted 15 March, 2016; originally announced March 2016.

arXiv:1602.08787 [pdf, other]

Maximum Pseudolikelihood Estimation for Model-Based Clustering of Time Series Data

Authors: Hien D Nguyen, Geoffrey J McLachlan, Pierre Orban, Pierre Bellec, Andrew L Janke

Abstract: Mixture of autoregressions (MoAR) models provide a model-based approach to the clustering of time series data. The maximum likelihood (ML) estimation of MoAR models requires the evaluation of products of large numbers of densities of normal random variables. In practical scenarios, these products converge to zero as the length of the time series increases, and thus the ML estimation of MoAR models… ▽ More Mixture of autoregressions (MoAR) models provide a model-based approach to the clustering of time series data. The maximum likelihood (ML) estimation of MoAR models requires the evaluation of products of large numbers of densities of normal random variables. In practical scenarios, these products converge to zero as the length of the time series increases, and thus the ML estimation of MoAR models becomes infeasible without the use of numerical tricks. We propose a maximum pseudolikelihood (MPL) estimation approach as an alternative to the use of numerical tricks. The MPL estimator is proved to be consistent and can be computed via an EM (expectation--maximization) algorithm. Simulations are used to assess the performance of the MPL estimator against that of the ML estimator in cases where the latter was able to be calculated. An application to the clustering of time series data arising from a resting-state fMRI experiment is presented as a demonstration of the methodology. △ Less

Submitted 17 October, 2016; v1 submitted 28 February, 2016; originally announced February 2016.

arXiv:1602.03697 [pdf, other]

Linear Mixed Models with Marginally Symmetric Nonparametric Random Effects

Authors: Hien D. Nguyen, Geoffrey J. McLachlan

Abstract: Linear mixed models (LMMs) are used as an important tool in the data analysis of repeated measures and longitudinal studies. The most common form of LMMs utilize a normal distribution to model the random effects. Such assumptions can often lead to misspecification errors when the random effects are not normal. One approach to remedy the misspecification errors is to utilize a point-mass distributi… ▽ More Linear mixed models (LMMs) are used as an important tool in the data analysis of repeated measures and longitudinal studies. The most common form of LMMs utilize a normal distribution to model the random effects. Such assumptions can often lead to misspecification errors when the random effects are not normal. One approach to remedy the misspecification errors is to utilize a point-mass distribution to model the random effects; this is known as the nonparametric maximum likelihood-fitted (NPML) model. The NPML model is flexible but requires a large number of parameters to characterize the random-effects distribution. It is often natural to assume that the random-effects distribution be at least marginally symmetric. The marginally symmetric NPML (MSNPML) random-effects model is introduced, which assumes a marginally symmetric point-mass distribution for the random effects. Under the symmetry assumption, the MSNPML model utilizes half the number of parameters to characterize the same number of point masses as the NPML model; thus the model confers an advantage in economy and parsimony. An EM-type algorithm is presented for the maximum likelihood (ML) estimation of LMMs with MSNPML random effects; the algorithm is shown to monotonically increase the log-likelihood and is proven to be convergent to a stationary point of the log-likelihood function in the case of convergence. Furthermore, it is shown that the ML estimator is consistent and asymptotically normal under certain conditions, and the estimation of quantities such as the random-effects covariance matrix and individual a posteriori expectations is demonstrated. △ Less

Submitted 13 February, 2016; v1 submitted 11 February, 2016; originally announced February 2016.

arXiv:1602.03692 [pdf, other]

Maximum Likelihood Estimation of Triangular and Polygonal Distributions

Authors: Hien D Nguyen, Geoffrey J McLachlan

Abstract: Triangular distributions are a well-known class of distributions that are often used as elementary example of a probability model. In the past, enumeration and order statistic-based methods have been suggested for the maximum likelihood (ML) estimation of such distributions. A novel parametrization of triangular distributions is presented. The parametrization allows for the construction of an MM (… ▽ More Triangular distributions are a well-known class of distributions that are often used as elementary example of a probability model. In the past, enumeration and order statistic-based methods have been suggested for the maximum likelihood (ML) estimation of such distributions. A novel parametrization of triangular distributions is presented. The parametrization allows for the construction of an MM (minorization--maximization) algorithm for the ML estimation of triangular distributions. The algorithm is shown to both monotonically increase the likelihood evaluations, and be globally convergent. Using the parametrization is then applied to construct an MM algorithm for the ML estimation of polygonal distributions. This algorithm is shown to have the same numerical properties as that of the triangular distribution. Numerical simulation are provided to demonstrate the performances of the new algorithms against established enumeration and order statistics-based methods. △ Less

Submitted 13 February, 2016; v1 submitted 11 February, 2016; originally announced February 2016.

arXiv:1602.03683 [pdf, other]

A Universal Approximation Theorem for Mixture of Experts Models

Authors: Hien D Nguyen, Luke R Lloyd-Jones, Geoffrey J McLachlan

Abstract: The mixture of experts (MoE) model is a popular neural network architecture for nonlinear regression and classification. The class of MoE mean functions is known to be uniformly convergent to any unknown target function, assuming that the target function is from Sobolev space that is sufficiently differentiable and that the domain of estimation is a compact unit hypercube. We provide an alternativ… ▽ More The mixture of experts (MoE) model is a popular neural network architecture for nonlinear regression and classification. The class of MoE mean functions is known to be uniformly convergent to any unknown target function, assuming that the target function is from Sobolev space that is sufficiently differentiable and that the domain of estimation is a compact unit hypercube. We provide an alternative result, which shows that the class of MoE mean functions is dense in the class of all continuous functions over arbitrary compact domains of estimation. Our result can be viewed as a universal approximation theorem for MoE models. △ Less

Submitted 11 February, 2016; originally announced February 2016.

arXiv:1601.03517 [pdf, other]

Spatial Clustering of Time-Series via Mixture of Autoregressions Models and Markov Random Fields

Authors: Hien D Nguyen, Geoffrey J McLachlan, Jeremy F P Ullmann, Andrew L Janke

Abstract: Time-series data arise in many medical and biological imaging scenarios. In such images, a time-series is obtained at each of a large number of spatially-dependent data units. It is interesting to organize these data into model-based clusters. A two-stage procedure is proposed. In Stage 1, a mixture of autoregressions (MoAR) model is used to marginally cluster the data. The MoAR model is fitted us… ▽ More Time-series data arise in many medical and biological imaging scenarios. In such images, a time-series is obtained at each of a large number of spatially-dependent data units. It is interesting to organize these data into model-based clusters. A two-stage procedure is proposed. In Stage 1, a mixture of autoregressions (MoAR) model is used to marginally cluster the data. The MoAR model is fitted using maximum marginal likelihood (MMaL) estimation via an MM (minorization--maximization) algorithm. In Stage 2, a Markov random field (MRF) model induces a spatial structure onto the Stage 1 clustering. The MRF model is fitted using maximum pseudolikelihood (MPL) estimation via an MM algorithm. Both the MMaL and MPL estimators are proved to be consistent. Numerical properties are established for both MM algorithms. A simulation study demonstrates the performance of the two-stage procedure. An application to the segmentation of a zebrafish brain calcium image is presented. △ Less

Submitted 14 January, 2016; originally announced January 2016.

arXiv:1601.00773 [pdf, other]

Comment on "On Nomenclature, and the Relative Merits of Two Formulations of Skew Distributions" by A. Azzalini, R. Browne, M. Genton, and P. McNicholas

Authors: Geoffrey J. McLachlan, Sharon X. Lee

Abstract: We comment on the recent paper by Azzalini et al. (2015) on two different distributions proposed in the literature for the modelling of data that have asymmetric and possibly long-tailed clusters. They are referred to as the restricted and unrestricted skew normal and skew t-distributions by Lee and McLachlan (2013a). We clarify an apparent misunderstanding in Azzalini et al.(2015) of this nomencl… ▽ More We comment on the recent paper by Azzalini et al. (2015) on two different distributions proposed in the literature for the modelling of data that have asymmetric and possibly long-tailed clusters. They are referred to as the restricted and unrestricted skew normal and skew t-distributions by Lee and McLachlan (2013a). We clarify an apparent misunderstanding in Azzalini et al.(2015) of this nomenclature to distinguish between these two models. Also, we note that McLachlan and Lee (2014) have obtained improved results for the unrestricted model over those reported in Azzalini et al. (2015) for the two datasets that were analysed by them to form the basis of their claimson the relative superiority of the restricted and unrestricted models. On this matter of the relative superiority of these two models, Lee and McLachlan (2014b, 2016) have shown how a distribution belonging to the broader class, the canonical fundamental skew t (CFUST) class, can be fitted with little additional computational effort than for the unrestricted distribution. The CFUST class includes the restricted and unrestricted distributions as special cases. Thus the user now has the option of letting the data decide as to which model is appropriate for their particular dataset. △ Less

Submitted 5 January, 2016; originally announced January 2016.

arXiv:1509.02069 [pdf, other]

EMMIXcskew: an R Package for the Fitting of a Mixture of Canonical Fundamental Skew t-Distributions

Authors: Sharon X. Lee, Geoffrey J. McLachlan

Abstract: This paper presents an R package EMMIXcskew for the fitting of the canonical fundamental skew t-distribution (CFUST) and finite mixtures of this distribution (FM-CFUST) via maximum likelihood (ML). The CFUST distribution provides a flexible family of models to handle non-normal data, with parameters for capturing skewness and heavy-tails in the data. It formally encompasses the normal, t, and skew… ▽ More This paper presents an R package EMMIXcskew for the fitting of the canonical fundamental skew t-distribution (CFUST) and finite mixtures of this distribution (FM-CFUST) via maximum likelihood (ML). The CFUST distribution provides a flexible family of models to handle non-normal data, with parameters for capturing skewness and heavy-tails in the data. It formally encompasses the normal, t, and skew-normal distributions as special and/or limiting cases. A few other versions of the skew t-distributions are also nested within the CFUST distribution. In this paper, an Expectation-Maximization (EM) algorithm is described for computing the ML estimates of the parameters of the FM-CFUST model, and different strategies for initializing the algorithm are discussed and illustrated. The methodology is implemented in the EMMIXcskew package, and examples are presented using two real datasets. The EMMIXcskew package contains functions to fit the FM-CFUST model, including procedures for generating different initial values. Additional features include random sample generation and contour visualization in 2D and 3D. △ Less

Submitted 9 February, 2017; v1 submitted 7 September, 2015; originally announced September 2015.

arXiv:1411.2820 [pdf, other]

Supervised Classification of Flow Cytometric Samples via the Joint Clustering and Matching (JCM) Procedure

Authors: Sharon X. Lee, Geoffrey J. McLachlan, Saumyadipta Pyne

Abstract: We consider the use of the Joint Clustering and Matching (JCM) procedure for the supervised classification of a flow cytometric sample with respect to a number of predefined classes of such samples. The JCM procedure has been proposed as a method for the unsupervised classification of cells within a sample into a number of clusters and in the case of multiple samples, the matching of these cluster… ▽ More We consider the use of the Joint Clustering and Matching (JCM) procedure for the supervised classification of a flow cytometric sample with respect to a number of predefined classes of such samples. The JCM procedure has been proposed as a method for the unsupervised classification of cells within a sample into a number of clusters and in the case of multiple samples, the matching of these clusters across the samples. The two tasks of clustering and matching of the clusters are performed simultaneously within the JCM framework. In this paper, we consider the case where there is a number of distinct classes of samples whose class of origin is known, and the problem is to classify a new sample of unknown class of origin to one of these predefined classes. For example, the different classes might correspond to the types of a particular disease or to the various health outcomes of a patient subsequent to a course of treatment. We show and demonstrate on some real datasets how the JCM procedure can be used to carry out this supervised classification task. A mixture distribution is used to model the distribution of the expressions of a fixed set of markers for each cell in a sample with the components in the mixture model corresponding to the various populations of cells in the composition of the sample. For each class of samples, a class template is formed by the adoption of random-effects terms to model the inter-sample variation within a class. The classification of a new unclassified sample is undertaken by assigning the unclassified sample to the class that minimizes the Kullback-Leibler distance between its fitted mixture density and each class density provided by the class templates. △ Less

Submitted 11 November, 2014; originally announced November 2014.

arXiv:1405.0685 [pdf, other]

Finite Mixtures of Canonical Fundamental Skew t-Distributions

Authors: Sharon X. Lee, Geoffrey J. McLachlan

Abstract: This is an extended version of the paper Lee and McLachlan (2014b) with simulations and applications added. This paper introduces a finite mixture of canonical fundamental skew t (CFUST) distributions for a model-based approach to clustering where the clusters are asymmetric and possibly long-tailed (Lee and McLachlan, 2014b). The family of CFUST distributions includes the restricted multivariate… ▽ More This is an extended version of the paper Lee and McLachlan (2014b) with simulations and applications added. This paper introduces a finite mixture of canonical fundamental skew t (CFUST) distributions for a model-based approach to clustering where the clusters are asymmetric and possibly long-tailed (Lee and McLachlan, 2014b). The family of CFUST distributions includes the restricted multivariate skew t (rMST) and unrestricted multivariate skew t (uMST) distributions as special cases. In recent years, a few versions of the multivariate skew t (MST) model have been put forward, together with various EM-type algorithms for parameter estimation. These formulations adopted either a restricted or unrestricted characterization for their MST densities. In this paper, we examine a natural generalization of these developments, employing the CFUST distribution as the parametric family for the component distributions, and point out that the restricted and unrestricted characterizations can be unified under this general formulation. We show that an exact implementation of the EM algorithm can be achieved for the CFUST distribution and mixtures of this distribution, and present some new analytical results for a conditional expectation involved in the E-step. △ Less

Submitted 4 May, 2014; originally announced May 2014.

Comments: This is an extended version of the paper Lee and McLachlan (2014b) with simulations and applications added

arXiv:1404.1733 [pdf, other]

Comment on "Comparing two formulations of skew distributions with special reference to model-based clustering" by A. Azzalini, R. Browne, M. Genton, and P. McNicholas

Authors: Geoffrey J. McLachlan, Sharon X. Lee

Abstract: In this paper, we comment on the recent comparison in Azzalini et al. (2014) of two different distributions proposed in the literature for the modelling of data that have asymmetric and possibly long-tailed clusters. They are referred to as the restricted and unrestricted skew t-distributions by Lee and McLachlan (2013a). Firstly, we wish to point out that in Lee and McLachlan (2014b), which prece… ▽ More In this paper, we comment on the recent comparison in Azzalini et al. (2014) of two different distributions proposed in the literature for the modelling of data that have asymmetric and possibly long-tailed clusters. They are referred to as the restricted and unrestricted skew t-distributions by Lee and McLachlan (2013a). Firstly, we wish to point out that in Lee and McLachlan (2014b), which preceded this comparison, it is shown how a distribution belonging to the broader class, the canonical fundamental skew t (CFUST) class, can be fitted with essentially no additional computational effort than for the unrestricted distribution. The CFUST class includes the restricted and unrestricted distributions as special cases. Thus the user now has the option of letting the data decide as to which model is appropriate for their particular dataset. Secondly, we wish to identify several statements in the comparison by Azzalini et al.(2014) that demonstrate a serious misunderstanding of the reporting of results in Lee and McLachlan (2014a) on the relative performance of these two skew t-distributions. In particular, there is an apparent misunderstanding of the nomenclature that has been adopted to distinguish between these two models. Thirdly, we take the opportunity to report here that we have obtained improved fits, in some cases a marked improvement, for the unrestricted model for various cases corresponding to different combinations of the variables in the two real datasets that were used in Azzalini et al. (2014) to mount their claims on the relative superiority of the restricted and unrestricted models. For one case the misclassification rate of our fit under the unrestricted model is less than one third of their reported error rate. Our results thus reverse their claims on the ranking of the restricted and unrestricted models in such cases. △ Less

Submitted 7 April, 2014; originally announced April 2014.

arXiv:1401.8182 [pdf, other]

Maximum Likelihood Estimation for Finite Mixtures of Canonical Fundamental Skew t-Distributions: the Unification of the Unrestricted and Restricted Skew t-Mixture Models

Authors: Sharon X. Lee, Geoffrey J. McLachlan

Abstract: In this paper, we present an algorithm for the fitting of a location-scale variant of the canonical fundamental skew t (CFUST) distribution, a superclass of the restricted and unrestricted skew t-distributions. In recent years, a few versions of the multivariate skew $t$ (MST) model have been put forward, together with various EM-type algorithms for parameter estimation. These formulations adopted… ▽ More In this paper, we present an algorithm for the fitting of a location-scale variant of the canonical fundamental skew t (CFUST) distribution, a superclass of the restricted and unrestricted skew t-distributions. In recent years, a few versions of the multivariate skew $t$ (MST) model have been put forward, together with various EM-type algorithms for parameter estimation. These formulations adopted either a restricted or unrestricted characterization for their MST densities. In this paper, we examine a natural generalization of these developments, employing the CFUST distribution as the parametric family for the component distributions, and point out that the restricted and unrestricted characterizations can be unified under this general formulation. We show that an exact implementation of the EM algorithm can be achieved for the CFUST distribution and mixtures of this distribution, and present some new analytical results for a conditional expectation involved in the E-step. △ Less

Submitted 31 January, 2014; originally announced January 2014.

arXiv:1310.5336 [pdf, other]

The skew-t factor analysis model

Authors: Tsung-I Lin, Pal H. Wu, Geoffrey J. McLachlan, Sharon X. Lee

Abstract: Factor analysis is a classical data reduction technique that seeks a potentially lower number of unobserved variables that can account for the correlations among the observed variables. This paper presents an extension of the factor analysis model by assuming jointly a restricted version of multivariate skew t distribution for the latent factors and unobservable errors, called the skew-t factor an… ▽ More Factor analysis is a classical data reduction technique that seeks a potentially lower number of unobserved variables that can account for the correlations among the observed variables. This paper presents an extension of the factor analysis model by assuming jointly a restricted version of multivariate skew t distribution for the latent factors and unobservable errors, called the skew-t factor analysis model. The proposed model shows robustness to violations of normality assumptions of the underlying latent factors and provides flexibility in capturing extra skewness as well as heavier tails of the observed data. A computationally feasible ECM algorithm is developed for computing maximum likelihood estimates of the parameters. The usefulness of the proposed methodology is illustrated by a real-life example and results also demonstrates its better performance over various existing methods. △ Less

Submitted 3 December, 2013; v1 submitted 20 October, 2013; originally announced October 2013.

arXiv:1307.7784 [pdf, other]

Inference on differences between classes using cluster-specific contrasts of mixed effects

Authors: S. K. Ng, G. J. McLachlan, K. Wang, Z. Nagymanyoki, S. Liu, S. -W. Ng

Abstract: The detection of differentially expressed (DE) genes is one of the most commonly studied problems in bioinformatics. For example, the identification of DE genes between distinct disease phenotypes is an important first step in understanding and developing treatment drugs for the disease. It can also contribute significantly to the construction of a discriminant rule for predicting the class of ori… ▽ More The detection of differentially expressed (DE) genes is one of the most commonly studied problems in bioinformatics. For example, the identification of DE genes between distinct disease phenotypes is an important first step in understanding and developing treatment drugs for the disease. It can also contribute significantly to the construction of a discriminant rule for predicting the class of origin of an unclassified tissue sample from a patient. We present a novel approach to the problem of detecting DE genes that is based on a test statistic formed as a weighted (normalized) cluster-specific contrast in the mixed effects of the mixture model used in the first instance to cluster the gene profiles into a manageable number of clusters. The key factor in the formation of our test statistic is the use of gene-specific mixed effects in the cluster-specific contrast. It thus means that the (soft) assignment of a given gene to a cluster is not crucial. This is because in addition to class differences between the (estimated) fixed effects terms for a cluster, gene-specific class differences also contribute to the cluster-specific contributions to the final form of the test statistic. The proposed test statistic can be used where the primary aim is to rank the genes in order of evidence against the null hypothesis of no DE. We also show how a P-value can be calculated for each gene for use in multiple hypothesis testing where the intent is to control the false discovery rate (FDR) at some desired level. With the use of real and simulated data sets, we show that the proposed contrast-based approach outperforms other methods commonly used for the detection of DE genes both in a ranking context with lower proportion of false discoveries and in a multiple hypothesis testing context with higher power for a specified level of the FDR. △ Less

Submitted 29 July, 2013; originally announced July 2013.

Comments: 17 pages, 3 figures, 2 tables

arXiv:1307.1748 [pdf, other]

Extending mixtures of factor models using the restricted multivariate skew-normal distribution

Authors: Tsung-I Lin, Geoffrey J. McLachlan, Sharon X. Lee

Abstract: The mixture of factor analyzers (MFA) model provides a powerful tool for analyzing high-dimensional data as it can reduce the number of free parameters through its factor-analytic representation of the component covariance matrices. This paper extends the MFA model to incorporate a restricted version of the multivariate skew-normal distribution to model the distribution of the latent component fac… ▽ More The mixture of factor analyzers (MFA) model provides a powerful tool for analyzing high-dimensional data as it can reduce the number of free parameters through its factor-analytic representation of the component covariance matrices. This paper extends the MFA model to incorporate a restricted version of the multivariate skew-normal distribution to model the distribution of the latent component factors, called mixtures of skew-normal factor analyzers (MSNFA). The proposed MSNFA model allows us to relax the need for the normality assumption for the latent factors in order to accommodate skewness in the observed data. The MSNFA model thus provides an approach to model-based density estimation and clustering of high-dimensional data exhibiting asymmetric characteristics. A computationally feasible ECM algorithm is developed for computing the maximum likelihood estimates of the parameters. Model selection can be made on the basis of three commonly used information-based criteria. The potential of the proposed methodology is exemplified through applications to two real examples, and the results are compared with those obtained from fitting the MFA model. △ Less

Submitted 6 July, 2013; originally announced July 2013.

arXiv:1306.3014 [pdf, other]

Mixtures of Spatial Spline Regressions

Authors: Hien D. Nguyen, Geoffrey J. McLachlan, Ian A. Wood

Abstract: We present an extension of the functional data analysis framework for univariate functions to the analysis of surfaces: functions of two variables. The spatial spline regression (SSR) approach developed can be used to model surfaces that are sampled over a rectangular domain. Furthermore, combining SSR with linear mixed effects models (LMM) allows for the analysis of populations of surfaces, and c… ▽ More We present an extension of the functional data analysis framework for univariate functions to the analysis of surfaces: functions of two variables. The spatial spline regression (SSR) approach developed can be used to model surfaces that are sampled over a rectangular domain. Furthermore, combining SSR with linear mixed effects models (LMM) allows for the analysis of populations of surfaces, and combining the joint SSR-LMM method with finite mixture models allows for the analysis of populations of surfaces with sub-family structures. Through the mixtures of spatial splines regressions (MSSR) approach developed, we present methodologies for clustering surfaces into sub-families, and for performing surface-based discriminant analysis. The effectiveness of our methodologies, as well as the modeling capabilities of the SSR model are assessed through an application to handwritten character recognition. △ Less

Submitted 13 June, 2013; v1 submitted 12 June, 2013; originally announced June 2013.

arXiv:1305.7344 [pdf, other]

doi 10.1371/journal.pone.0100334

Joint Modeling and Registration of Cell Populations in Cohorts of High-Dimensional Flow Cytometric Data

Authors: Saumyadipta Pyne, Kui Wang, Jonathan Irish, Pablo Tamayo, Marc-Danie Nazaire, Tarn Duong, Sharon Lee, Shu-Kay Ng, David Hafler, Ronald Levy, Garry Nolan, Jill Mesirov, Geoffrey J. McLachlan

Abstract: In systems biomedicine, an experimenter encounters different potential sources of variation in data such as individual samples, multiple experimental conditions, and multi-variable network-level responses. In multiparametric cytometry, which is often used for analyzing patient samples, such issues are critical. While computational methods can identify cell populations in individual samples, withou… ▽ More In systems biomedicine, an experimenter encounters different potential sources of variation in data such as individual samples, multiple experimental conditions, and multi-variable network-level responses. In multiparametric cytometry, which is often used for analyzing patient samples, such issues are critical. While computational methods can identify cell populations in individual samples, without the ability to automatically match them across samples, it is difficult to compare and characterize the populations in typical experiments, such as those responding to various stimulations or distinctive of particular patients or time-points, especially when there are many samples. Joint Clustering and Matching (JCM) is a multi-level framework for simultaneous modeling and registration of populations across a cohort. JCM models every population with a robust multivariate probability distribution. Simultaneously, JCM fits a random-effects model to construct an overall batch template -- used for registering populations across samples, and classifying new samples. By tackling systems-level variation, JCM supports practical biomedical applications involving large cohorts. △ Less

Submitted 31 May, 2013; originally announced May 2013.

arXiv:1211.5290 [pdf, ps, other]

EMMIX-uskew: An R Package for Fitting Mixtures of Multivariate Skew t-distributions via the EM Algorithm

Authors: Sharon X. Lee, Geoffrey J. McLachlan

Abstract: This paper describes an algorithm for fitting finite mixtures of unrestricted Multivariate Skew t (FM-uMST) distributions. The package EMMIX-uskew implements a closed-form expectation-maximization (EM) algorithm for computing the maximum likelihood (ML) estimates of the parameters for the (unrestricted) FM-MST model in R. EMMIX-uskew also supports visualization of fitted contours in two and three… ▽ More This paper describes an algorithm for fitting finite mixtures of unrestricted Multivariate Skew t (FM-uMST) distributions. The package EMMIX-uskew implements a closed-form expectation-maximization (EM) algorithm for computing the maximum likelihood (ML) estimates of the parameters for the (unrestricted) FM-MST model in R. EMMIX-uskew also supports visualization of fitted contours in two and three dimensions, and random sample generation from a specified FM-uMST distribution. Finite mixtures of skew t-distributions have proven to be useful in modelling heterogeneous data with asymmetric and heavy tail behaviour, for example, datasets from flow cytometry. In recent years, various versions of mixtures with multivariate skew t (MST) distributions have been proposed. However, these models adopted some restricted characterizations of the component MST distributions so that the E-step of the EM algorithm can be evaluated in closed form. This paper focuses on mixtures with unrestricted MST components, and describes an iterative algorithm for the computation of the ML estimates of its model parameters. The usefulness of the proposed algorithm is demonstrated in three applications to real data sets. The first example illustrates the use of the main function fmmst in the package by fitting a MST distribution to a bivariate unimodal flow cytometric sample. The second example fits a mixture of MST distributions to the Australian Institute of Sport (AIS) data, and demonstrate that EMMIX-uskew can provide better clustering results than mixtures with restricted MST components. In the third example, EMMIX-uskew is applied to classify cells in a trivariate flow cytometric dataset. Comparisons with other available methods suggests that the EMMIX-uskew result achieved a lower misclassification rate with respect to the labels given by benchmark gating analysis. △ Less

Submitted 27 March, 2013; v1 submitted 22 November, 2012; originally announced November 2012.

arXiv:1211.3602 [pdf, ps, other]

doi 10.1007/s11634-013-0132-8

On Mixtures of Skew Normal and Skew t-Distributions

Authors: Sharon X. Lee, Geoffrey J. McLachlan

Abstract: Finite mixture of skew distributions have emerged as an effective tool in modelling heterogeneous data with asymmetric features. With various proposals appearing rapidly in the recent years, which are similar but not identical, the connections between them and their relative performance becomes rather unclear. This paper aims to provide a concise overview of these developments by presenting a syst… ▽ More Finite mixture of skew distributions have emerged as an effective tool in modelling heterogeneous data with asymmetric features. With various proposals appearing rapidly in the recent years, which are similar but not identical, the connections between them and their relative performance becomes rather unclear. This paper aims to provide a concise overview of these developments by presenting a systematic classification of the existing skew distributions into four types, thereby clarifying their close relationships. This also aids in understanding the link between some of the proposed expectation-maximization (EM) based algorithms for the computation of the maximum likelihood estimates of the parameters of the models. The final part of this paper presents an illustration of the performance of these mixture models in clustering a real dataset, relative to other non-elliptically contoured clustering methods and associated algorithms for their implementation. △ Less

Submitted 28 May, 2013; v1 submitted 15 November, 2012; originally announced November 2012.

Journal ref: Advances in Data Analysis and Classification 2013

arXiv:1109.4764 [pdf, ps, other]

Clustering of time-course gene expression profiles using normal mixture models with AR(1) random effects

Authors: K. Wang, S. K. Ng, G. J. McLachlan

Abstract: Time-course gene expression data such as yeast cell cycle data may be periodically expressed. To cluster such data, currently used Fourier series approximations of periodic gene expressions have been found not to be sufficiently adequate to model the complexity of the time-course data, partly due to their ignoring the dependence between the expression measurements over time and the correlation amo… ▽ More Time-course gene expression data such as yeast cell cycle data may be periodically expressed. To cluster such data, currently used Fourier series approximations of periodic gene expressions have been found not to be sufficiently adequate to model the complexity of the time-course data, partly due to their ignoring the dependence between the expression measurements over time and the correlation among gene expression profiles. We further investigate the advantages and limitations of available models in the literature and propose a new mixture model with AR(1) random effects for the clustering of time-course gene-expression profiles. Some simulations and real examples are given to demonstrate the usefulness of the proposed models. △ Less

Submitted 22 September, 2011; originally announced September 2011.

arXiv:1109.4706 [pdf, ps, other]

On the fitting of mixtures of multivariate skew t-distributions via the EM algorithm

Authors: S. X. Lee, G. J. McLachlan

Abstract: We show how the expectation-maximization (EM) algorithm can be applied exactly for the fitting of mixtures of general multivariate skew t (MST) distributions, eliminating the need for computationally expensive Monte Carlo estimation. Finite mixtures of MST distributions have proven to be useful in modelling heterogeneous data with asymmetric and heavy tail behaviour. Recently, they have been explo… ▽ More We show how the expectation-maximization (EM) algorithm can be applied exactly for the fitting of mixtures of general multivariate skew t (MST) distributions, eliminating the need for computationally expensive Monte Carlo estimation. Finite mixtures of MST distributions have proven to be useful in modelling heterogeneous data with asymmetric and heavy tail behaviour. Recently, they have been exploited as an effective tool for modelling flow cytometric data. However, without restrictions on the the characterizations of the component skew t-distributions, Monte Carlo methods have been used to fit these models. In this paper, we show how the EM algorithm can be implemented for the iterative computation of the maximum likelihood estimates of the model parameters without resorting to Monte Carlo methods for mixtures with unrestricted MST components. The fast calculation of semi-infinite integrals on the E-step of the EM algorithm is effected by noting that they can be put in the form of moments of the truncated multivariate t-distribution, which subsequently can be expressed in terms of the non-truncated form of the t-distribution function for which fast algorithms are available. We demonstrate the usefulness of the proposed methodology by some applications to three real data sets. △ Less

Submitted 5 September, 2012; v1 submitted 22 September, 2011; originally announced September 2011.

Showing 1–50 of 53 results for author: McLachlan, G J