Showing 1–2 of 2 results for author: Melnykov, I
-
K-Means Clustering With Incomplete Data with the Use of Mahalanobis Distances
Authors:
Lovis Kwasi Armah,
Igor Melnykov
Abstract:
Effectively applying the K-means algorithm to clustering tasks with incomplete features remains an important research area due to its impact on real-world applications. Recent work has shown that unifying K-means clustering and imputation into one single objective function and solving the resultant optimization yield superior results compared to handling imputation and clustering separately.
In…
▽ More
Effectively applying the K-means algorithm to clustering tasks with incomplete features remains an important research area due to its impact on real-world applications. Recent work has shown that unifying K-means clustering and imputation into one single objective function and solving the resultant optimization yield superior results compared to handling imputation and clustering separately.
In this work, we extend this approach by developing a unified K-means algorithm that incorporates Mahalanobis distances, instead of the traditional Euclidean distances, which previous research has shown to perform better for clusters with elliptical shapes.
We conducted extensive experiments on synthetic datasets containing up to ten elliptical clusters, as well as the IRIS dataset. Using the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), we demonstrate that our algorithm consistently outperforms both standalone imputation followed by K-means (using either Mahalanobis or Euclidean distance) and K-Means with Incomplete Data, the recent K-means algorithms that integrate imputation and clustering for handling incomplete data. These results hold across both the IRIS dataset and randomly generated data with elliptical clusters.
△ Less
Submitted 10 April, 2025; v1 submitted 30 October, 2024;
originally announced November 2024.
-
Long-Tail Theory under Gaussian Mixtures
Authors:
Arman Bolatov,
Maxat Tezekbayev,
Igor Melnykov,
Artur Pak,
Vassilina Nikoulina,
Zhenisbek Assylbekov
Abstract:
We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered…
▽ More
We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered for optimal generalization to new data. Finally, we show that the performance gap between linear and nonlinear models can be lessened as the tail becomes shorter in the subpopulation frequency distribution, as confirmed by experiments on synthetic and real data.
△ Less
Submitted 24 July, 2023; v1 submitted 20 July, 2023;
originally announced July 2023.