-
New Hard-thresholding Rules based on Data Splitting in High-dimensional Imbalanced Classification
Authors:
Arezou Mojiri,
Abbas Khalili,
Ali Zeinal Hamadani
Abstract:
In binary classification, imbalance refers to situations in which one class is heavily under-represented. This issue is due to either a data collection process or because one class is indeed rare in a population. Imbalanced classification frequently arises in applications such as biology, medicine, engineering, and social sciences. In this paper, for the first time, we theoretically study the impa…
▽ More
In binary classification, imbalance refers to situations in which one class is heavily under-represented. This issue is due to either a data collection process or because one class is indeed rare in a population. Imbalanced classification frequently arises in applications such as biology, medicine, engineering, and social sciences. In this paper, for the first time, we theoretically study the impact of imbalance class sizes on the linear discriminant analysis (LDA) in high dimensions. We show that due to data scarcity in one class, referred to as the minority class, and high-dimensionality of the feature space, the LDA ignores the minority class yielding a maximum misclassification rate. We then propose a new construction of hard-thresholding rules based on a data splitting technique that reduces the large difference between the misclassification rates. We show that the proposed method is asymptotically optimal. We further study two well-known sparse versions of the LDA in imbalanced cases. We evaluate the finite-sample performance of different methods using simulations and by analyzing two real data sets. The results show that our method either outperforms its competitors or has comparable performance based on a much smaller subset of selected features, while being computationally more efficient.
△ Less
Submitted 6 January, 2022; v1 submitted 5 November, 2021;
originally announced November 2021.
-
Estimating the Number of Components in Finite Mixture Models via the Group-Sort-Fuse Procedure
Authors:
Tudor Manole,
Abbas Khalili
Abstract:
Estimation of the number of components (or order) of a finite mixture model is a long standing and challenging problem in statistics. We propose the Group-Sort-Fuse (GSF) procedure -- a new penalized likelihood approach for simultaneous estimation of the order and mixing measure in multidimensional finite mixture models. Unlike methods which fit and compare mixtures with varying orders using crite…
▽ More
Estimation of the number of components (or order) of a finite mixture model is a long standing and challenging problem in statistics. We propose the Group-Sort-Fuse (GSF) procedure -- a new penalized likelihood approach for simultaneous estimation of the order and mixing measure in multidimensional finite mixture models. Unlike methods which fit and compare mixtures with varying orders using criteria involving model complexity, our approach directly penalizes a continuous function of the model parameters. More specifically, given a conservative upper bound on the order, the GSF groups and sorts mixture component parameters to fuse those which are redundant. For a wide range of finite mixture models, we show that the GSF is consistent in estimating the true mixture order and achieves the $n^{-1/2}$ convergence rate for parameter estimation up to polylogarithmic factors. The GSF is implemented for several univariate and multivariate mixture models in the R package GroupSortFuse. Its finite sample performance is supported by a thorough simulation study, and its application is illustrated on two real data examples.
△ Less
Submitted 4 August, 2021; v1 submitted 23 May, 2020;
originally announced May 2020.
-
Estimating Sparse Networks with Hubs
Authors:
Annaliza McGillivray,
Abbas Khalili,
David A. Stephens
Abstract:
Graphical modelling techniques based on sparse selection have been applied to infer complex networks in many fields, including biology and medicine, engineering, finance, and social sciences. One structural feature of some of the networks in such applications that poses a challenge for statistical inference is the presence of a small number of strongly interconnected nodes in a network which are c…
▽ More
Graphical modelling techniques based on sparse selection have been applied to infer complex networks in many fields, including biology and medicine, engineering, finance, and social sciences. One structural feature of some of the networks in such applications that poses a challenge for statistical inference is the presence of a small number of strongly interconnected nodes in a network which are called hubs. For example, in microbiome research hubs or microbial taxa play a significant role in maintaining stability of the microbial community structure. In this paper, we investigate the problem of estimating sparse networks in which there are a few highly connected hub nodes. Methods based on L1-regularization have been widely used for performing sparse selection in the graphical modelling context. However, while these methods encourage sparsity, they do not take into account structural information of the network. We introduce a new method for estimating networks with hubs that exploits the ability of (inverse) covariance selection methods to include structural information about the underlying network. Our proposed method is a weighted lasso approach with novel row/column sum weights, which we refer to as the hubs weighted graphical lasso. We establish large sample properties of the method when the number of parameters diverges with the sample size, and evaluate its finite sample performance via extensive simulations. We illustrate the method with an application to microbiome data.
△ Less
Submitted 1 March, 2020; v1 submitted 19 April, 2019;
originally announced April 2019.