-
The quantile-based classifier with variable-wise parameters
Authors:
Marco Berrettini,
Christian Hennig,
Cinzia Viroli
Abstract:
Quantile-based classifiers can classify high-dimensional observations by minimising a discrepancy of an observation to a class based on suitable quantiles of the within-class distributions, corresponding to a unique percentage for all variables. The present work extends these classifiers by introducing a way to determine potentially different optimal percentages for different variables. Furthermor…
▽ More
Quantile-based classifiers can classify high-dimensional observations by minimising a discrepancy of an observation to a class based on suitable quantiles of the within-class distributions, corresponding to a unique percentage for all variables. The present work extends these classifiers by introducing a way to determine potentially different optimal percentages for different variables. Furthermore, a variable-wise scale parameter is introduced. A simple greedy algorithm to estimate the parameters is proposed. Their consistency in a nonparametric setting is proved. Experiments using artificially generated and real data confirm the potential of the quantile-based classifier with variable-wise parameters.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Dealing with overdispersion in multivariate count data
Authors:
Noemi Corsini,
Cinzia Viroli
Abstract:
The problem of overdispersion in multivariate count data is a challenging issue. Nowadays, it covers a central role mainly due to the relevance of modern technologies data, such as Next Generation Sequencing and textual data from the web or digital collections. This work presents a comprehensive analysis of the likelihood-based models for extra-variation data proposed in the scientific literature.…
▽ More
The problem of overdispersion in multivariate count data is a challenging issue. Nowadays, it covers a central role mainly due to the relevance of modern technologies data, such as Next Generation Sequencing and textual data from the web or digital collections. This work presents a comprehensive analysis of the likelihood-based models for extra-variation data proposed in the scientific literature. Particular attention will be paid to the models feasible for high-dimensional data. A new approach together with its parametric-estimation procedure is proposed. It is a deeper version of the Dirichlet-Multinomial distribution and it leads to important results allowing to get a better approximation of the observed variability. A significative comparison of these models is made through two different simulation studies that both confirm that the new model considered in this work allows to achieve the best results.
△ Less
Submitted 1 July, 2021;
originally announced July 2021.
-
Mixed data Deep Gaussian Mixture Model: A clustering model for mixed datasets
Authors:
Robin Fuchs,
Denys Pommeret,
Cinzia Viroli
Abstract:
Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. A clustering algorithm should be able, despite of this heterogeneity, to extract discriminant pieces of information from the variables in order to design groups. In this work we introduce a multilayer architecture model-based clustering method called Mixed Deep Gaussian Mixture Model (MDG…
▽ More
Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. A clustering algorithm should be able, despite of this heterogeneity, to extract discriminant pieces of information from the variables in order to design groups. In this work we introduce a multilayer architecture model-based clustering method called Mixed Deep Gaussian Mixture Model (MDGMM) that can be viewed as an automatic way to merge the clustering performed separately on continuous and non-continuous data. This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data. In this sense we generalize Generalized Linear Latent Variable Models and Deep Gaussian Mixture Models. We also design a new initialisation strategy and a data driven method that selects the best specification of the model and the optimal number of clusters for a given dataset "on the fly". Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets.
△ Less
Submitted 10 March, 2021; v1 submitted 13 October, 2020;
originally announced October 2020.
-
Directional quantile classifiers
Authors:
Alessio Farcomeni,
Marco Geraci,
Cinzia Viroli
Abstract:
We introduce classifiers based on directional quantiles. We derive theoretical results for selecting optimal quantile levels given a direction, and, conversely, an optimal direction given a quantile level. We also show that the misclassification rate is infinitesimal if population distributions differ by at most a location shift and if the number of directions is allowed to diverge at the same rat…
▽ More
We introduce classifiers based on directional quantiles. We derive theoretical results for selecting optimal quantile levels given a direction, and, conversely, an optimal direction given a quantile level. We also show that the misclassification rate is infinitesimal if population distributions differ by at most a location shift and if the number of directions is allowed to diverge at the same rate of the problem's dimension. We illustrate the satisfactory performance of our proposed classifiers in both small and high dimensional settings via a simulation study and a real data example. The code implementing the proposed methods is publicly available in the R package Qtools.
△ Less
Submitted 11 September, 2020; v1 submitted 10 September, 2020;
originally announced September 2020.
-
Classifying textual data: shallow, deep and ensemble methods
Authors:
Laura Anderlucci,
Lucia Guastadisegni,
Cinzia Viroli
Abstract:
This paper focuses on a comparative evaluation of the most common and modern methods for text classification, including the recent deep learning strategies and ensemble methods. The study is motivated by a challenging real data problem, characterized by high-dimensional and extremely sparse data, deriving from incoming calls to the customer care of an Italian phone company. We will show that deep…
▽ More
This paper focuses on a comparative evaluation of the most common and modern methods for text classification, including the recent deep learning strategies and ensemble methods. The study is motivated by a challenging real data problem, characterized by high-dimensional and extremely sparse data, deriving from incoming calls to the customer care of an Italian phone company. We will show that deep learning outperforms many classical (shallow) strategies but the combination of shallow and deep learning methods in a unique ensemble classifier may improve the robustness and the accuracy of "single" classification methods.
△ Less
Submitted 18 February, 2019;
originally announced February 2019.
-
Deep Mixtures of Unigrams for uncovering Topics in Textual Data
Authors:
Cinzia Viroli,
Laura Anderlucci
Abstract:
Mixtures of Unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by Multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation c…
▽ More
Mixtures of Unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by Multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight on the grouping structure. In this work, we developed a deep version of mixtures of Unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. The behaviour of the Deep Mixtures of Unigrams is empirically compared with that of other traditional and state-of-the-art methods, namely $k$-means with cosine distance, $k$-means with Euclidean distance on data transformed according to Semantic Analysis, Partition Around Medoids, Mixture of Gaussians on semantic-based transformed data, hierarchical clustering according to Ward's method with cosine dissimilarity, Latent Dirichlet Allocation, Mixtures of Unigrams estimated via the EM algorithm, Spectral Clustering and Affinity Propagation clustering. The performance is evaluated in terms of both correct classification rate and Adjusted Rand Index. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy.
△ Less
Submitted 9 December, 2020; v1 submitted 18 February, 2019;
originally announced February 2019.
-
Quantile-based clustering
Authors:
Christian Hennig,
Cinzia Viroli,
Laura Anderlucci
Abstract:
A new cluster analysis method, $K$-quantiles clustering, is introduced. $K$-quantiles clustering can be computed by a simple greedy algorithm in the style of the classical Lloyd's algorithm for $K$-means. It can be applied to large and high-dimensional datasets. It allows for within-cluster skewness and internal variable scaling based on within-cluster variation. Different versions allow for diffe…
▽ More
A new cluster analysis method, $K$-quantiles clustering, is introduced. $K$-quantiles clustering can be computed by a simple greedy algorithm in the style of the classical Lloyd's algorithm for $K$-means. It can be applied to large and high-dimensional datasets. It allows for within-cluster skewness and internal variable scaling based on within-cluster variation. Different versions allow for different levels of parsimony and computational efficiency. Although $K$-quantiles clustering is conceived as nonparametric, it can be connected to a fixed partition model of generalized asymmetric Laplace-distributions. The consistency of $K$-quantiles clustering is proved, and it is shown that $K$-quantiles clusters correspond to well separated mixture components in a nonparametric mixture. In a simulation, $K$-quantiles clustering is compared with a number of popular clustering methods with good results. A high-dimensional microarray dataset is clustered by $K$-quantiles.
△ Less
Submitted 8 November, 2019; v1 submitted 27 June, 2018;
originally announced June 2018.
-
Deep Gaussian Mixture Models
Authors:
Cinzia Viroli,
Geoffrey J. McLachlan
Abstract:
Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Th…
▽ More
Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Thus, the deep mixture model consists of a set of nested mixtures of linear models, which globally provide a nonlinear model able to describe the data in a very flexible way. In order to avoid overparameterized solutions, dimension reduction by factor models can be applied at each layer of the architecture thus resulting in deep mixtures of factor analysers.
△ Less
Submitted 18 November, 2017;
originally announced November 2017.
-
The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015
Authors:
Laura Anderlucci,
Angela Montanari,
Cinzia Viroli
Abstract:
In this paper we retrace the recent history of statistics by analyzing all the papers published in five prestigious statistical journals since 1970, namely: Annals of Statistics, Biometrika, Journal of the American Statistical Association, Journal of the Royal Statistical Society, series B and Statistical Science. The aim is to construct a kind of "taxonomy" of the statistical papers by organizing…
▽ More
In this paper we retrace the recent history of statistics by analyzing all the papers published in five prestigious statistical journals since 1970, namely: Annals of Statistics, Biometrika, Journal of the American Statistical Association, Journal of the Royal Statistical Society, series B and Statistical Science. The aim is to construct a kind of "taxonomy" of the statistical papers by organizing and by clustering them in main themes. In this sense being identified in a cluster means being important enough to be uncluttered in the vast and interconnected world of the statistical research. Since the main statistical research topics naturally born, evolve or die during time, we will also develop a dynamic clustering strategy, where a group in a time period is allowed to migrate or to merge into different groups in the following one. Results show that statistics is a very dynamic and evolving science, stimulated by the rise of new research questions and types of data.
△ Less
Submitted 11 September, 2017;
originally announced September 2017.
-
Infinite Mixtures of Infinite Factor Analysers
Authors:
Keefe Murphy,
Cinzia Viroli,
Isobel Claire Gormley
Abstract:
Factor-analytic Gaussian mixture models are often employed as a model-based approach to clustering high-dimensional data. Typically, the numbers of clusters and latent factors must be specified in advance of model fitting, and remain fixed. The pair which optimises some model selection criterion is then chosen. For computational reasons, models in which the number of latent factors differ across c…
▽ More
Factor-analytic Gaussian mixture models are often employed as a model-based approach to clustering high-dimensional data. Typically, the numbers of clusters and latent factors must be specified in advance of model fitting, and remain fixed. The pair which optimises some model selection criterion is then chosen. For computational reasons, models in which the number of latent factors differ across clusters are rarely considered. Here the infinite mixture of infinite factor analysers (IMIFA) model is introduced. IMIFA employs a Pitman-Yor process prior to facilitate automatic inference of the number of clusters using the stick-breaking construction and a slice sampler. Furthermore, IMIFA employs multiplicative gamma process shrinkage priors to allow cluster-specific numbers of factors, automatically inferred via an adaptive Gibbs sampler. IMIFA is presented as the flagship of a family of factor-analytic mixture models, providing flexible approaches to clustering high-dimensional data. Applications to a benchmark data set, metabolomic spectral data, and a manifold learning handwritten digit example illustrate the IMIFA model and its advantageous features. These include obviating the need for model selection criteria, reducing the computational burden associated with the search of the model space, improving clustering performance by allowing cluster-specific numbers of factors, and quantifying uncertainty in the numbers of clusters and cluster-specific factors.
△ Less
Submitted 13 July, 2021; v1 submitted 24 January, 2017;
originally announced January 2017.
-
Bayesian Smooth-and-Match strategy for ordinary differential equations models that are linear in the parameters
Authors:
Saverio Ranciati,
Cinzia Viroli,
Ernst Wit
Abstract:
In many fields of application, dynamic processes that evolve through time are well described by systems of ordinary differential equations (ODEs). The analytical solution of the ODEs is often not available and different methods have been proposed to infer these quantities: from numerical optimization to regularized (penalized) models, these procedures aim to estimate indirectly the parameters with…
▽ More
In many fields of application, dynamic processes that evolve through time are well described by systems of ordinary differential equations (ODEs). The analytical solution of the ODEs is often not available and different methods have been proposed to infer these quantities: from numerical optimization to regularized (penalized) models, these procedures aim to estimate indirectly the parameters without solving the system. We focus on the class of techniques that use smoothing to avoid direct integration and, in particular, on a Bayesian Smooth-and-Match strategy that allows to obtain the ODEs' solution while performing inference on models that are linear in the parameters. We incorporate in the strategy two main sources of uncertainty: the noise level in the measurements and the model error. We assess the performance of the proposed approach in three different simulation studies and we compare the results on a dataset on neuron electrical activity.
△ Less
Submitted 18 July, 2017; v1 submitted 8 April, 2016;
originally announced April 2016.
-
Mixture model with multiple allocations for clustering spatially correlated observations in the analysis of ChIP-Seq data
Authors:
Saverio Ranciati,
Cinzia Viroli,
Ernst Wit
Abstract:
Model-based clustering is a technique widely used to group a collection of units into mutually exclusive groups. There are, however, situations in which an observation could in principle belong to more than one cluster. In the context of Next-Generation Sequencing (NGS) experiments, for example, the signal observed in the data might be produced by two (or more) different biological processes opera…
▽ More
Model-based clustering is a technique widely used to group a collection of units into mutually exclusive groups. There are, however, situations in which an observation could in principle belong to more than one cluster. In the context of Next-Generation Sequencing (NGS) experiments, for example, the signal observed in the data might be produced by two (or more) different biological processes operating together and a gene could participate in both (or all) of them. We propose a novel approach to cluster NGS discrete data, coming from a ChIP-Seq experiment, with a mixture model, allowing each unit to belong potentially to more than one group: these multiple allocation clusters can be flexibly defined via a function combining the features of the original groups without introducing new parameters. The formulation naturally gives rise to a `zero-inflation group' in which values close to zero can be allocated, acting as a correction for the abundance of zeros that manifest in this type of data. We take into account the spatial dependency between observations, which is described through a latent Conditional Auto-Regressive process that can reflect different dependency patterns. We assess the performance of our model within a simulation environment and then we apply it to ChIP-seq real data.
△ Less
Submitted 12 May, 2016; v1 submitted 19 January, 2016;
originally announced January 2016.
-
Modelling overdispersion heterogeneity in differential expression analysis using mixtures
Authors:
Elisabetta Bonafede,
Franck Picard,
Stéphane Robin,
Cinzia Viroli
Abstract:
Next-generation sequencing technologies now constitute a method of choice to measure gene expression. Data to analyze are read counts, commonly modeled using Negative Binomial distributions. A relevant issue associated with this probabilistic framework is the reliable estimation of the overdispersion parameter, reinforced by the limited number of replicates generally observable for each gene. Many…
▽ More
Next-generation sequencing technologies now constitute a method of choice to measure gene expression. Data to analyze are read counts, commonly modeled using Negative Binomial distributions. A relevant issue associated with this probabilistic framework is the reliable estimation of the overdispersion parameter, reinforced by the limited number of replicates generally observable for each gene. Many strategies have been proposed to estimate this parameter, but when differential analysis is the purpose, they often result in procedures based on plug-in estimates, and we show here that this discrepancy between the estimation framework and the testing framework can lead to uncontrolled type-I errors. Instead we propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Three consistent statistical tests are developed for differential expression analysis. We show that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it is the best one in reaching the nominal value for the first-type error, while keeping elevate power. The method is finally illustrated on prostate cancer RNA-seq data.
△ Less
Submitted 7 November, 2014; v1 submitted 23 October, 2014;
originally announced October 2014.
-
Covariance pattern mixture models for the analysis of multivariate heterogeneous longitudinal data
Authors:
Laura Anderlucci,
Cinzia Viroli
Abstract:
We propose a novel approach for modeling multivariate longitudinal data in the presence of unobserved heterogeneity for the analysis of the Health and Retirement Study (HRS) data. Our proposal can be cast within the framework of linear mixed models with discrete individual random intercepts; however, differently from the standard formulation, the proposed Covariance Pattern Mixture Model (CPMM) do…
▽ More
We propose a novel approach for modeling multivariate longitudinal data in the presence of unobserved heterogeneity for the analysis of the Health and Retirement Study (HRS) data. Our proposal can be cast within the framework of linear mixed models with discrete individual random intercepts; however, differently from the standard formulation, the proposed Covariance Pattern Mixture Model (CPMM) does not require the usual local independence assumption. The model is thus able to simultaneously model the heterogeneity, the association among the responses and the temporal dependence structure. We focus on the investigation of temporal patterns related to the cognitive functioning in retired American respondents. In particular, we aim to understand whether it can be affected by some individual socio-economical characteristics and whether it is possible to identify some homogenous groups of respondents that share a similar cognitive profile. An accurate description of the detected groups allows government policy interventions to be opportunely addressed. Results identify three homogenous clusters of individuals with specific cognitive functioning, consistent with the class conditional distribution of the covariates. The flexibility of CPMM allows for a different contribution of each regressor on the responses according to group membership. In so doing, the identified groups receive a global and accurate phenomenological characterization.
△ Less
Submitted 16 September, 2015; v1 submitted 7 January, 2014;
originally announced January 2014.
-
Quantile-based classifiers
Authors:
Christian Hennig,
Cinzia Viroli
Abstract:
Quantile classifiers for potentially high-dimensional data are defined by classifying an observation according to a sum of appropriately weighted component-wise distances of the components of the observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample.
It is shown that this is consistent, fo…
▽ More
Quantile classifiers for potentially high-dimensional data are defined by classifying an observation according to a sum of appropriately weighted component-wise distances of the components of the observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample.
It is shown that this is consistent, for $n \to \infty$, for the classification rule with asymptotically optimal quantile, and that, under some assumptions, for $p\to\infty$ the probability of correct classification converges to one. The role of skewness of the involved variables is discussed, which leads to an improved classifier.
The optimal quantile classifier performs very well in a comprehensive simulation study and a real data set from chemistry (classification of bioaerosols) compared to nine other classifiers, including the support vector machine and the recently proposed median-based classifier (Hall et al., 2009), which inspired the quantile classifier.
△ Less
Submitted 12 November, 2013; v1 submitted 6 March, 2013;
originally announced March 2013.
-
A factor mixture analysis model for multivariate binary data
Authors:
Silvia Cagnone,
Cinzia Viroli
Abstract:
The paper proposes a latent variable model for binary data coming from an unobserved heterogeneous population. The heterogeneity is taken into account by replacing the traditional assumption of Gaussian distributed factors by a finite mixture of multivariate Gaussians. The aim of the proposed model is twofold: it allows to achieve dimension reduction when the data are dichotomous and, simultaneous…
▽ More
The paper proposes a latent variable model for binary data coming from an unobserved heterogeneous population. The heterogeneity is taken into account by replacing the traditional assumption of Gaussian distributed factors by a finite mixture of multivariate Gaussians. The aim of the proposed model is twofold: it allows to achieve dimension reduction when the data are dichotomous and, simultaneously, it performs model based clustering in the latent space. Model estimation is obtained by means of a maximum likelihood method via a generalized version of the EM algorithm. In order to evaluate the performance of the model a simulation study and two real applications are illustrated.
△ Less
Submitted 12 October, 2010;
originally announced October 2010.
-
Stochastic model selection for Mixtures of Matrix-Normals
Authors:
Cinzia Viroli
Abstract:
Finite mixtures of matrix normal distributions are a powerful tool for classifying three-way data in unsupervised problems. The distribution of each component is assumed to be a matrix variate normal density. The mixture model can be estimated through the EM algorithm under the assumption that the number of components is known and fixed. In this work we introduce, develop and explore a Bayesian an…
▽ More
Finite mixtures of matrix normal distributions are a powerful tool for classifying three-way data in unsupervised problems. The distribution of each component is assumed to be a matrix variate normal density. The mixture model can be estimated through the EM algorithm under the assumption that the number of components is known and fixed. In this work we introduce, develop and explore a Bayesian analysis of the model in order to provide a tool for simultaneous model estimation and model selection. The effectiveness of the proposed method is illustrated on a simulation study and on a real example.
△ Less
Submitted 6 March, 2013; v1 submitted 12 October, 2010;
originally announced October 2010.