Skip to main content

Showing 1–50 of 53 results for author: McLachlan, G J

Searching in archive stat. Search in all archives.
.
  1. arXiv:2302.13206  [pdf, other

    stat.CO

    Semi-supervised Gaussian mixture modelling with a missing-data mechanism in R

    Authors: Ziyang Lyu, Daniel Ahfock, Ryan Thompson, Geoffrey J. McLachlan

    Abstract: Semi-supervised learning is being extensively applied to estimate classifiers from training data in which not all the labels of the feature vectors are available. We present gmmsslm, an R package for estimating the Bayes' classifier from such partially classified data in the case where the feature vector has a multivariate Gaussian (normal) distribution in each of the predefined classes. Our packa… ▽ More

    Submitted 16 April, 2024; v1 submitted 25 February, 2023; originally announced February 2023.

    Comments: To appear in the Australian and New Zealand Journal of Statistics

  2. arXiv:2202.02249  [pdf, other

    stat.ME cs.LG stat.CO stat.ML

    Functional Mixtures-of-Experts

    Authors: Faïcel Chamroukhi, Nhat Thien Pham, Van Hà Hoang, Geoffrey J. McLachlan

    Abstract: We consider the statistical analysis of heterogeneous data for prediction in situations where the observations include functions, typically time series. We extend the modeling with Mixtures-of-Experts (ME), as a framework of choice in modeling heterogeneity in data for prediction with vectorial observations, to this functional data analysis context. We first present a new family of ME models, name… ▽ More

    Submitted 20 December, 2023; v1 submitted 4 February, 2022; originally announced February 2022.

    MSC Class: 62-XX; 62R10 ACM Class: G.3

  3. arXiv:2104.04046  [pdf, other

    stat.ML cs.LG

    Semi-Supervised Learning of Classifiers from a Statistical Perspective: A Brief Review

    Authors: Daniel Ahfock, Geoffrey J. McLachlan

    Abstract: There has been increasing attention to semi-supervised learning (SSL) approaches in machine learning to forming a classifier in situations where the training data for a classifier consists of a limited number of classified observations but a much larger number of unclassified observations. This is because the procurement of classified data can be quite costly due to high acquisition costs and subs… ▽ More

    Submitted 9 November, 2021; v1 submitted 8 April, 2021; originally announced April 2021.

  4. arXiv:2104.02888  [pdf, other

    stat.ME

    Data-fusion using factor analysis and low-rank matrix completion

    Authors: Daniel Ahfock, Saumyadipta Pyne, Geoffrey J. McLachlan

    Abstract: Data-fusion involves the integration of multiple related datasets. The statistical file-matching problem is a canonical data-fusion problem in multivariate analysis, where the objective is to characterise the joint distribution of a set of variables when only strict subsets of marginal distributions have been observed. Estimation of the covariance matrix of the full set of variables is challenging… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

  5. arXiv:2104.02872  [pdf, other

    stat.ML cs.LG

    Harmless label noise and informative soft-labels in supervised classification

    Authors: Daniel Ahfock, Geoffrey J. McLachlan

    Abstract: Manual labelling of training examples is common practice in supervised learning. When the labelling task is of non-trivial difficulty, the supplied labels may not be equal to the ground-truth labels, and label noise is introduced into the training dataset. If the manual annotation is carried out by multiple experts, the same training example can be given different class assignments by different ex… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

  6. arXiv:2009.10622  [pdf, other

    math.ST cs.AI cs.LG stat.ME stat.ML

    Non-asymptotic oracle inequalities for the Lasso in high-dimensional mixture of experts

    Authors: TrungTin Nguyen, Hien D Nguyen, Faicel Chamroukhi, Geoffrey J McLachlan

    Abstract: We investigate the estimation properties of the mixture of experts (MoE) model in a high-dimensional setting, where the number of predictors is much larger than the sample size, and for which the literature is particularly lacking in theoretical results. We consider the class of softmax-gated Gaussian MoE (SGMoE) models, defined as MoE models with softmax gating functions and Gaussian experts, and… ▽ More

    Submitted 2 July, 2024; v1 submitted 22 September, 2020; originally announced September 2020.

    Comments: Revise and add numerical experiments

    MSC Class: 62E17 (Primary) 62H12; 62H30 (Secondary)

  7. arXiv:2005.06848  [pdf, ps, other

    stat.CO

    Multi-Node EM Algorithm for Finite Mixture Models

    Authors: Sharon X. Lee, Geoffrey J. McLachlan, Kaleb L. Leemaqz

    Abstract: Finite mixture models are powerful tools for modelling and analyzing heterogeneous data. Parameter estimation is typically carried out using maximum likelihood estimation via the Expectation-Maximization (EM) algorithm. Recently, the adoption of flexible distributions as component densities has become increasingly popular. Often, the EM algorithm for these models involves complicated expressions t… ▽ More

    Submitted 14 May, 2020; originally announced May 2020.

    Comments: 12 Pages,1 figure

  8. arXiv:2004.06237  [pdf, other

    stat.ML cs.LG

    Estimation of Classification Rules from Partially Classified Data

    Authors: Geoffrey J. McLachlan, Daniel Ahfock

    Abstract: We consider the situation where the observed sample contains some observations whose class of origin is known (that is, they are classified with respect to the g underlying classes of interest), and where the remaining observations in the sample are unclassified (that is, their class labels are unknown). For class-conditional distributions taken to be known up to a vector of unknown parameters, th… ▽ More

    Submitted 13 April, 2020; originally announced April 2020.

    Comments: Based on invited talk given to the 16th Conference of the International Federation of Classification Societies in Thessaloniki, August 2019

  9. arXiv:1910.09189  [pdf, other

    stat.ME

    An Apparent Paradox: A Classifier Trained from a Partially Classified Sample May Have Smaller Expected Error Rate Than That If the Sample Were Completely Classified

    Authors: Daniel Ahfock, Geoffrey J. McLachlan

    Abstract: There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence assuming that the labels of the unclassified features are randomly missing or their miss… ▽ More

    Submitted 6 November, 2019; v1 submitted 21 October, 2019; originally announced October 2019.

  10. arXiv:1904.12057  [pdf, ps, other

    stat.ME

    Comment on "Hidden truncation hyperbolic distributions, finite mixtures thereof and their application for clustering" Murray, Browne, and \McNicholas

    Authors: Geoffrey J. McLachlan, Sharon X. Lee

    Abstract: We comment on the paper of Murray, Browne, and McNicholas (2017), who proposed mixtures of skew distributions, which they termed hidden truncation hyperbolic (HTH). They recently made a clarification (Murray, Browne, McNicholas, 2019) concerning their claim that the so-called CFUST distribution is a special case of the HTH distribution. There are also some other matters in the original version of… ▽ More

    Submitted 26 April, 2019; originally announced April 2019.

    Comments: 7 pages

  11. arXiv:1904.02883  [pdf, other

    stat.ME math.ST

    On missing label patterns in semi-supervised learning

    Authors: Daniel Ahfock, Geoffrey J. McLachlan

    Abstract: We investigate model based classification with partially labelled training data. In many biostatistical applications, labels are manually assigned by experts, who may leave some observations unlabelled due to class uncertainty. We analyse semi-supervised learning as a missing data problem and identify situations where the missing label pattern is non-ignorable for the purposes of maximum likelihoo… ▽ More

    Submitted 5 April, 2019; originally announced April 2019.

  12. arXiv:1903.12342  [pdf, other

    stat.ME

    Statistical matching of non-Gaussian data

    Authors: Daniel Ahfock, Saumyadipta Pyne, Geoffrey J. McLachlan

    Abstract: The statistical matching problem is a data integration problem with structured missing data. The general form involves the analysis of multiple datasets that only have a strict subset of variables jointly observed across all datasets. The simplest version involves two datasets, labelled A and B, with three variables of interest $X, Y$ and $Z$. Variables $X$ and $Y$ are observed in dataset A and va… ▽ More

    Submitted 28 March, 2019; originally announced March 2019.

  13. arXiv:1902.03335  [pdf, other

    stat.CO

    Mini-batch learning of exponential family finite mixture models

    Authors: H D Nguyen, F Forbes, G J McLachlan

    Abstract: Mini-batch algorithms have become increasingly popular due to the requirement for solving optimization problems, based on large-scale data sets. Using an existing online expectation-{}-maximization (EM) algorithm framework, we demonstrate how mini-batch (MB) algorithms may be constructed, and propose a scheme for the stochastic stabilization of the constructed mini-batch algorithms. Theoretical re… ▽ More

    Submitted 5 September, 2019; v1 submitted 8 February, 2019; originally announced February 2019.

  14. arXiv:1810.04842  [pdf, ps, other

    stat.ME

    On formulations of skew factor models: skew errors versus skew factors

    Authors: Sharon X. Lee, Geoffrey J. McLachlan

    Abstract: In the past few years, there have been a number of proposals for generalizing the factor analysis (FA) model and its mixture version (known as mixtures of factor analyzers (MFA)) using non-normal and asymmetric distributions. These models adopt various types of skew densities for either the factors or the errors. While the relationships between various choices of skew distributions have been discu… ▽ More

    Submitted 20 November, 2018; v1 submitted 11 October, 2018; originally announced October 2018.

  15. arXiv:1805.04394  [pdf, other

    stat.ME

    False discovery rate control under reduced precision computation for analysis of neuroimaging data

    Authors: Hien D. Nguyen, Yohan Yee, Geoffrey J. McLachlan, Jason P. Lerch

    Abstract: The mitigation of false positives is an important issue when conducting multiple hypothesis testing. The most popular paradigm for false positives mitigation in high-dimensional applications is via the control of the false discovery rate (FDR). Multiple testing data from neuroimaging experiments can be very large, and reduced precision storage of such data is often required. Reduced precision comp… ▽ More

    Submitted 16 July, 2018; v1 submitted 11 May, 2018; originally announced May 2018.

  16. arXiv:1804.08365  [pdf, other

    stat.CO

    Positive data kernel density estimation via the logKDE package for R

    Authors: Andrew T. Jones, Hien D. Nguyen, Geoffrey J. McLachlan

    Abstract: Kernel density estimators (KDEs) are ubiquitous tools for nonparametric estimation of probability density functions (PDFs), when data are obtained from unknown data generating processes. The KDEs that are typically available in software packages are defined, and designed, to estimate real-valued data. When applied to positive data, these typical KDEs do not yield bona fide PDFs. A log-transformati… ▽ More

    Submitted 5 August, 2018; v1 submitted 23 April, 2018; originally announced April 2018.

  17. arXiv:1804.08341  [pdf, other

    stat.CO

    Randomized Mixture Models for Probability Density Approximation and Estimation

    Authors: Hien D. Nguyen, Dianhui Wang, Geoffrey J. McLachlan

    Abstract: Randomized neural networks (NNs) are an interesting alternative to conventional NNs that are more used for data modeling. The random vector functional-link (RVFL) network is an established and theoretically well-grounded randomized learning model. A key theoretical result for RVFL networks is that they provide universal approximation for continuous maps, on average, almost surely. We specialize an… ▽ More

    Submitted 23 April, 2018; originally announced April 2018.

  18. arXiv:1802.02467  [pdf, other

    stat.ME

    Mixtures of Factor Analyzers with Fundamental Skew Symmetric Distributions

    Authors: Sharon X. Lee, Tsung-I Lin, Geoffrey J. McLachlan

    Abstract: Mixtures of factor analyzers (MFA) provide a powerful tool for modelling high-dimensional datasets. In recent years, several generalizations of MFA have been developed where the normality assumption of the factors and/or of the errors was relaxed to allow for skewness in the data. However, due to the form of the adopted component densities, the distribution of the factors/errors in most of these m… ▽ More

    Submitted 26 October, 2018; v1 submitted 7 February, 2018; originally announced February 2018.

  19. arXiv:1711.06929  [pdf, ps, other

    stat.ML cs.LG

    Deep Gaussian Mixture Models

    Authors: Cinzia Viroli, Geoffrey J. McLachlan

    Abstract: Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Th… ▽ More

    Submitted 18 November, 2017; originally announced November 2017.

    Comments: 19 pages, 4 figures

  20. arXiv:1705.04651  [pdf, other

    stat.CO cs.LG stat.ML

    Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization--Minimization Algorithm Approach

    Authors: Hien D. Nguyen, Geoffrey J. McLachlan

    Abstract: Support vector machines (SVMs) are an important tool in modern data analysis. Traditionally, support vector machines have been fitted via quadratic programming, either using purpose-built or off-the-shelf algorithms. We present an alternative approach to SVM fitting via the majorization--minimization (MM) paradigm. Algorithms that are derived via MM algorithm constructions can be shown to monotoni… ▽ More

    Submitted 12 May, 2017; originally announced May 2017.

  21. arXiv:1701.04512  [pdf, other

    stat.ME

    Some Theoretical Results Regarding the Polygonal Distribution

    Authors: Hien D Nguyen, Geoffrey J McLachlan

    Abstract: The polygonal distributions are a class of distributions that can be defined via the mixture of triangular distributions over the unit interval. The class includes the uniform and trapezoidal distributions, and is an alternative to the beta distribution. We demonstrate that the polygonal densities are dense in the class of continuous and concave densities with bounded second derivatives. Pointwise… ▽ More

    Submitted 16 January, 2017; originally announced January 2017.

  22. arXiv:1612.06492  [pdf, other

    stat.CO

    Chunked-and-Averaged Estimators for Vector Parameters

    Authors: Hien D. Nguyen, Geoffrey J. McLachlan

    Abstract: A divide-and-conquer method for parameter estimation is the chunked-and-averaged (CA) estimator. CA estimators have been studied for univariate parameters under independent and identically distributed (IID) sampling. We study the CA estimators of vector parameters and under non-IID sampling.

    Submitted 29 August, 2017; v1 submitted 19 December, 2016; originally announced December 2016.

  23. arXiv:1611.03974  [pdf, other

    stat.OT math.ST

    On approximations via convolution-defined mixture models

    Authors: Hien D. Nguyen, Geoffrey J. McLachlan

    Abstract: An often-cited fact regarding mixing or mixture distributions is that their density functions are able to approximate the density function of any unknown distribution to arbitrary degrees of accuracy, provided that the mixing or mixture distribution is sufficiently complex. This fact is often not made concrete. We investigate and review theorems that provide approximation bounds for mixing distrib… ▽ More

    Submitted 1 March, 2018; v1 submitted 12 November, 2016; originally announced November 2016.

  24. arXiv:1611.01602  [pdf, other

    stat.ME

    Whole-Volume Clustering of Time Series Data from Zebrafish Brain Calcium Images via Mixture Modeling

    Authors: Hien D. Nguyen, Jeremy F. P. Ullmann, Geoffrey J. McLachlan, Venkatakaushik Voleti, Wenze Li, Elizabeth M. C. Hillman, David C. Reutens, Andrew L. Janke

    Abstract: Calcium is a ubiquitous messenger in neural signaling events. An increasing number of techniques are enabling visualization of neurological activity in animal models via luminescent proteins that bind to calcium ions. These techniques generate large volumes of spatially correlated time series. A model-based functional data analysis methodology via Gaussian mixtures is suggested for the clustering… ▽ More

    Submitted 28 February, 2017; v1 submitted 5 November, 2016; originally announced November 2016.

  25. arXiv:1608.05481  [pdf, other

    stat.CO

    Faster Functional Clustering via Gaussian Mixture Models

    Authors: Hien D Nguyen, Geoffrey J McLachlan, Jeremy F P Ullmann, Andrew L Janke

    Abstract: Functional data analysis (FDA) is an important modern paradigm for handling infinite-dimensional data. An important task in FDA is model-based clustering, which organizes functional populations into groups via subpopulation structures. The most common approach for model-based clustering of functional data is via mixtures of linear mixed-effects models. The mixture of linear mixed-effects models (M… ▽ More

    Submitted 12 February, 2017; v1 submitted 18 August, 2016; originally announced August 2016.

  26. arXiv:1608.02797  [pdf, other

    stat.CO cs.DC

    A block EM algorithm for multivariate skew normal and skew t-mixture models

    Authors: Sharon X Lee, Kaleb L Leemaqz, Geoffrey J McLachlan

    Abstract: Finite mixtures of skew distributions provide a flexible tool for modelling heterogeneous data with asymmetric distributional features. However, parameter estimation via the Expectation-Maximization (EM) algorithm can become very time-consuming due to the complicated expressions involved in the E-step that are numerically expensive to evaluate. A more time-efficient implementation of the EM algori… ▽ More

    Submitted 9 August, 2016; originally announced August 2016.

  27. arXiv:1607.04807  [pdf, other

    stat.OT math.ST

    Progress on a Conjecture Regarding the Triangular Distribution

    Authors: Hien D Nguyen, Geoffrey J McLachlan

    Abstract: Triangular distributions are a well-known class of distributions that are often used as an elementary example of a probability model. Maximum likelihood estimation of the mode parameter of the triangular distribution over the unit interval can be performed via an order statistics-based method. It had been conjectured that such a method can be conducted using only a constant number of likelihood fu… ▽ More

    Submitted 5 November, 2016; v1 submitted 16 July, 2016; originally announced July 2016.

  28. arXiv:1606.02054  [pdf, other

    stat.CO

    A simple multithreaded implementation of the EM algorithm for mixture models

    Authors: Sharon X Lee, Kaleb L Lee, Geoffrey J McLachlan

    Abstract: Finite mixture models have been widely used for the modelling and analysis of data from heterogeneous populations. Maximum likelihood estimation of the parameters is typically carried out via the Expectation-Maximization (EM) algorithm. The complexity of the implementation of the algorithm depends on the parametric distribution that is adopted as the component densities of the mixture model. In th… ▽ More

    Submitted 7 June, 2016; originally announced June 2016.

  29. arXiv:1603.08326  [pdf, other

    stat.AP

    A globally convergent algorithm for lasso-penalized mixture of linear regression models

    Authors: Luke R. Lloyd-Jones, Hien D. Nguyen, Geoffrey J. McLachlan

    Abstract: Variable selection is an old and pervasive problem in regression analysis. One solution is to impose a lasso penalty to shrink parameter estimates toward zero and perform continuous model selection. The lasso-penalized mixture of linear regressions model (L-MLR) is a class of regularization methods for the model selection problem in the fixed number of variables setting. In this article, we propos… ▽ More

    Submitted 2 May, 2016; v1 submitted 28 March, 2016; originally announced March 2016.

    Comments: 38 pages, 4 tables, 2 figures

  30. A Block Minorization--Maximization Algorithm for Heteroscedastic Regression

    Authors: Hien D. Nguyen, Luke R. Lloyd-Jones, Geoffrey J. McLachlan

    Abstract: The computation of the maximum likelihood (ML) estimator for heteroscedastic regression models is considered. The traditional Newton algorithms for the problem require matrix multiplications and inversions, which are bottlenecks in modern Big Data contexts. A new Big Data-appropriate minorization--maximization (MM) algorithm is considered for the computation of the ML estimator. The MM algorithm i… ▽ More

    Submitted 30 May, 2016; v1 submitted 15 March, 2016; originally announced March 2016.

  31. arXiv:1602.08787  [pdf, other

    stat.CO stat.ME

    Maximum Pseudolikelihood Estimation for Model-Based Clustering of Time Series Data

    Authors: Hien D Nguyen, Geoffrey J McLachlan, Pierre Orban, Pierre Bellec, Andrew L Janke

    Abstract: Mixture of autoregressions (MoAR) models provide a model-based approach to the clustering of time series data. The maximum likelihood (ML) estimation of MoAR models requires the evaluation of products of large numbers of densities of normal random variables. In practical scenarios, these products converge to zero as the length of the time series increases, and thus the ML estimation of MoAR models… ▽ More

    Submitted 17 October, 2016; v1 submitted 28 February, 2016; originally announced February 2016.

  32. arXiv:1602.03697  [pdf, other

    stat.ME

    Linear Mixed Models with Marginally Symmetric Nonparametric Random Effects

    Authors: Hien D. Nguyen, Geoffrey J. McLachlan

    Abstract: Linear mixed models (LMMs) are used as an important tool in the data analysis of repeated measures and longitudinal studies. The most common form of LMMs utilize a normal distribution to model the random effects. Such assumptions can often lead to misspecification errors when the random effects are not normal. One approach to remedy the misspecification errors is to utilize a point-mass distributi… ▽ More

    Submitted 13 February, 2016; v1 submitted 11 February, 2016; originally announced February 2016.

  33. arXiv:1602.03692  [pdf, other

    stat.CO

    Maximum Likelihood Estimation of Triangular and Polygonal Distributions

    Authors: Hien D Nguyen, Geoffrey J McLachlan

    Abstract: Triangular distributions are a well-known class of distributions that are often used as elementary example of a probability model. In the past, enumeration and order statistic-based methods have been suggested for the maximum likelihood (ML) estimation of such distributions. A novel parametrization of triangular distributions is presented. The parametrization allows for the construction of an MM (… ▽ More

    Submitted 13 February, 2016; v1 submitted 11 February, 2016; originally announced February 2016.

  34. arXiv:1602.03683  [pdf, other

    stat.ML

    A Universal Approximation Theorem for Mixture of Experts Models

    Authors: Hien D Nguyen, Luke R Lloyd-Jones, Geoffrey J McLachlan

    Abstract: The mixture of experts (MoE) model is a popular neural network architecture for nonlinear regression and classification. The class of MoE mean functions is known to be uniformly convergent to any unknown target function, assuming that the target function is from Sobolev space that is sufficiently differentiable and that the domain of estimation is a compact unit hypercube. We provide an alternativ… ▽ More

    Submitted 11 February, 2016; originally announced February 2016.

  35. arXiv:1601.03517  [pdf, other

    stat.ME

    Spatial Clustering of Time-Series via Mixture of Autoregressions Models and Markov Random Fields

    Authors: Hien D Nguyen, Geoffrey J McLachlan, Jeremy F P Ullmann, Andrew L Janke

    Abstract: Time-series data arise in many medical and biological imaging scenarios. In such images, a time-series is obtained at each of a large number of spatially-dependent data units. It is interesting to organize these data into model-based clusters. A two-stage procedure is proposed. In Stage 1, a mixture of autoregressions (MoAR) model is used to marginally cluster the data. The MoAR model is fitted us… ▽ More

    Submitted 14 January, 2016; originally announced January 2016.

  36. arXiv:1601.00773  [pdf, other

    math.ST stat.CO

    Comment on "On Nomenclature, and the Relative Merits of Two Formulations of Skew Distributions" by A. Azzalini, R. Browne, M. Genton, and P. McNicholas

    Authors: Geoffrey J. McLachlan, Sharon X. Lee

    Abstract: We comment on the recent paper by Azzalini et al. (2015) on two different distributions proposed in the literature for the modelling of data that have asymmetric and possibly long-tailed clusters. They are referred to as the restricted and unrestricted skew normal and skew t-distributions by Lee and McLachlan (2013a). We clarify an apparent misunderstanding in Azzalini et al.(2015) of this nomencl… ▽ More

    Submitted 5 January, 2016; originally announced January 2016.

  37. arXiv:1509.02069  [pdf, other

    stat.CO stat.ME

    EMMIXcskew: an R Package for the Fitting of a Mixture of Canonical Fundamental Skew t-Distributions

    Authors: Sharon X. Lee, Geoffrey J. McLachlan

    Abstract: This paper presents an R package EMMIXcskew for the fitting of the canonical fundamental skew t-distribution (CFUST) and finite mixtures of this distribution (FM-CFUST) via maximum likelihood (ML). The CFUST distribution provides a flexible family of models to handle non-normal data, with parameters for capturing skewness and heavy-tails in the data. It formally encompasses the normal, t, and skew… ▽ More

    Submitted 9 February, 2017; v1 submitted 7 September, 2015; originally announced September 2015.

  38. arXiv:1411.2820  [pdf, other

    q-bio.QM stat.ME stat.ML

    Supervised Classification of Flow Cytometric Samples via the Joint Clustering and Matching (JCM) Procedure

    Authors: Sharon X. Lee, Geoffrey J. McLachlan, Saumyadipta Pyne

    Abstract: We consider the use of the Joint Clustering and Matching (JCM) procedure for the supervised classification of a flow cytometric sample with respect to a number of predefined classes of such samples. The JCM procedure has been proposed as a method for the unsupervised classification of cells within a sample into a number of clusters and in the case of multiple samples, the matching of these cluster… ▽ More

    Submitted 11 November, 2014; originally announced November 2014.

  39. arXiv:1405.0685  [pdf, other

    stat.ME

    Finite Mixtures of Canonical Fundamental Skew t-Distributions

    Authors: Sharon X. Lee, Geoffrey J. McLachlan

    Abstract: This is an extended version of the paper Lee and McLachlan (2014b) with simulations and applications added. This paper introduces a finite mixture of canonical fundamental skew t (CFUST) distributions for a model-based approach to clustering where the clusters are asymmetric and possibly long-tailed (Lee and McLachlan, 2014b). The family of CFUST distributions includes the restricted multivariate… ▽ More

    Submitted 4 May, 2014; originally announced May 2014.

    Comments: This is an extended version of the paper Lee and McLachlan (2014b) with simulations and applications added

  40. arXiv:1404.1733  [pdf, other

    stat.ME

    Comment on "Comparing two formulations of skew distributions with special reference to model-based clustering" by A. Azzalini, R. Browne, M. Genton, and P. McNicholas

    Authors: Geoffrey J. McLachlan, Sharon X. Lee

    Abstract: In this paper, we comment on the recent comparison in Azzalini et al. (2014) of two different distributions proposed in the literature for the modelling of data that have asymmetric and possibly long-tailed clusters. They are referred to as the restricted and unrestricted skew t-distributions by Lee and McLachlan (2013a). Firstly, we wish to point out that in Lee and McLachlan (2014b), which prece… ▽ More

    Submitted 7 April, 2014; originally announced April 2014.

  41. arXiv:1401.8182  [pdf, other

    stat.ME

    Maximum Likelihood Estimation for Finite Mixtures of Canonical Fundamental Skew t-Distributions: the Unification of the Unrestricted and Restricted Skew t-Mixture Models

    Authors: Sharon X. Lee, Geoffrey J. McLachlan

    Abstract: In this paper, we present an algorithm for the fitting of a location-scale variant of the canonical fundamental skew t (CFUST) distribution, a superclass of the restricted and unrestricted skew t-distributions. In recent years, a few versions of the multivariate skew $t$ (MST) model have been put forward, together with various EM-type algorithms for parameter estimation. These formulations adopted… ▽ More

    Submitted 31 January, 2014; originally announced January 2014.

  42. arXiv:1310.5336  [pdf, other

    stat.ME

    The skew-t factor analysis model

    Authors: Tsung-I Lin, Pal H. Wu, Geoffrey J. McLachlan, Sharon X. Lee

    Abstract: Factor analysis is a classical data reduction technique that seeks a potentially lower number of unobserved variables that can account for the correlations among the observed variables. This paper presents an extension of the factor analysis model by assuming jointly a restricted version of multivariate skew t distribution for the latent factors and unobservable errors, called the skew-t factor an… ▽ More

    Submitted 3 December, 2013; v1 submitted 20 October, 2013; originally announced October 2013.

  43. arXiv:1307.7784  [pdf, other

    stat.ME

    Inference on differences between classes using cluster-specific contrasts of mixed effects

    Authors: S. K. Ng, G. J. McLachlan, K. Wang, Z. Nagymanyoki, S. Liu, S. -W. Ng

    Abstract: The detection of differentially expressed (DE) genes is one of the most commonly studied problems in bioinformatics. For example, the identification of DE genes between distinct disease phenotypes is an important first step in understanding and developing treatment drugs for the disease. It can also contribute significantly to the construction of a discriminant rule for predicting the class of ori… ▽ More

    Submitted 29 July, 2013; originally announced July 2013.

    Comments: 17 pages, 3 figures, 2 tables

  44. arXiv:1307.1748  [pdf, other

    stat.ME

    Extending mixtures of factor models using the restricted multivariate skew-normal distribution

    Authors: Tsung-I Lin, Geoffrey J. McLachlan, Sharon X. Lee

    Abstract: The mixture of factor analyzers (MFA) model provides a powerful tool for analyzing high-dimensional data as it can reduce the number of free parameters through its factor-analytic representation of the component covariance matrices. This paper extends the MFA model to incorporate a restricted version of the multivariate skew-normal distribution to model the distribution of the latent component fac… ▽ More

    Submitted 6 July, 2013; originally announced July 2013.

  45. arXiv:1306.3014  [pdf, other

    stat.ME stat.CO

    Mixtures of Spatial Spline Regressions

    Authors: Hien D. Nguyen, Geoffrey J. McLachlan, Ian A. Wood

    Abstract: We present an extension of the functional data analysis framework for univariate functions to the analysis of surfaces: functions of two variables. The spatial spline regression (SSR) approach developed can be used to model surfaces that are sampled over a rectangular domain. Furthermore, combining SSR with linear mixed effects models (LMM) allows for the analysis of populations of surfaces, and c… ▽ More

    Submitted 13 June, 2013; v1 submitted 12 June, 2013; originally announced June 2013.

  46. Joint Modeling and Registration of Cell Populations in Cohorts of High-Dimensional Flow Cytometric Data

    Authors: Saumyadipta Pyne, Kui Wang, Jonathan Irish, Pablo Tamayo, Marc-Danie Nazaire, Tarn Duong, Sharon Lee, Shu-Kay Ng, David Hafler, Ronald Levy, Garry Nolan, Jill Mesirov, Geoffrey J. McLachlan

    Abstract: In systems biomedicine, an experimenter encounters different potential sources of variation in data such as individual samples, multiple experimental conditions, and multi-variable network-level responses. In multiparametric cytometry, which is often used for analyzing patient samples, such issues are critical. While computational methods can identify cell populations in individual samples, withou… ▽ More

    Submitted 31 May, 2013; originally announced May 2013.

  47. arXiv:1211.5290  [pdf, ps, other

    stat.CO stat.ME

    EMMIX-uskew: An R Package for Fitting Mixtures of Multivariate Skew t-distributions via the EM Algorithm

    Authors: Sharon X. Lee, Geoffrey J. McLachlan

    Abstract: This paper describes an algorithm for fitting finite mixtures of unrestricted Multivariate Skew t (FM-uMST) distributions. The package EMMIX-uskew implements a closed-form expectation-maximization (EM) algorithm for computing the maximum likelihood (ML) estimates of the parameters for the (unrestricted) FM-MST model in R. EMMIX-uskew also supports visualization of fitted contours in two and three… ▽ More

    Submitted 27 March, 2013; v1 submitted 22 November, 2012; originally announced November 2012.

  48. On Mixtures of Skew Normal and Skew t-Distributions

    Authors: Sharon X. Lee, Geoffrey J. McLachlan

    Abstract: Finite mixture of skew distributions have emerged as an effective tool in modelling heterogeneous data with asymmetric features. With various proposals appearing rapidly in the recent years, which are similar but not identical, the connections between them and their relative performance becomes rather unclear. This paper aims to provide a concise overview of these developments by presenting a syst… ▽ More

    Submitted 28 May, 2013; v1 submitted 15 November, 2012; originally announced November 2012.

    Journal ref: Advances in Data Analysis and Classification 2013

  49. arXiv:1109.4764  [pdf, ps, other

    stat.ME

    Clustering of time-course gene expression profiles using normal mixture models with AR(1) random effects

    Authors: K. Wang, S. K. Ng, G. J. McLachlan

    Abstract: Time-course gene expression data such as yeast cell cycle data may be periodically expressed. To cluster such data, currently used Fourier series approximations of periodic gene expressions have been found not to be sufficiently adequate to model the complexity of the time-course data, partly due to their ignoring the dependence between the expression measurements over time and the correlation amo… ▽ More

    Submitted 22 September, 2011; originally announced September 2011.

  50. arXiv:1109.4706  [pdf, ps, other

    stat.ME

    On the fitting of mixtures of multivariate skew t-distributions via the EM algorithm

    Authors: S. X. Lee, G. J. McLachlan

    Abstract: We show how the expectation-maximization (EM) algorithm can be applied exactly for the fitting of mixtures of general multivariate skew t (MST) distributions, eliminating the need for computationally expensive Monte Carlo estimation. Finite mixtures of MST distributions have proven to be useful in modelling heterogeneous data with asymmetric and heavy tail behaviour. Recently, they have been explo… ▽ More

    Submitted 5 September, 2012; v1 submitted 22 September, 2011; originally announced September 2011.