Outlier-Robust Clustering of Non-Spherical Mixtures

Bakshi, Ainesh; Kothari, Pravesh

Computer Science > Data Structures and Algorithms

arXiv:2005.02970 (cs)

[Submitted on 6 May 2020 (v1), last revised 14 Dec 2020 (this version, v3)]

Title:Outlier-Robust Clustering of Non-Spherical Mixtures

Authors:Ainesh Bakshi, Pravesh Kothari

View PDF

Abstract:We give the first outlier-robust efficient algorithm for clustering a mixture of $k$ statistically separated d-dimensional Gaussians (k-GMMs). Concretely, our algorithm takes input an $\epsilon$-corrupted sample from a $k$-GMM and whp in $d^{\text{poly}(k/\eta)}$ time, outputs an approximate clustering that misclassifies at most $k^{O(k)}(\epsilon+\eta)$ fraction of the points whenever every pair of mixture components are separated by $1-\exp(-\text{poly}(k/\eta)^k)$ in total variation (TV) distance. Such a result was not previously known even for $k=2$. TV separation is the statistically weakest possible notion of separation and captures important special cases such as mixed linear regression and subspace clustering.
Our main conceptual contribution is to distill simple analytic properties - (certifiable) hypercontractivity and bounded variance of degree 2 polynomials and anti-concentration of linear projections - that are necessary and sufficient for mixture models to be (efficiently) clusterable. As a consequence, our results extend to clustering mixtures of arbitrary affine transforms of the uniform distribution on the $d$-dimensional unit sphere. Even the information-theoretic clusterability of separated distributions satisfying these two analytic assumptions was not known prior to our work and is likely to be of independent interest.
Our algorithms build on the recent sequence of works relying on certifiable anti-concentration first introduced in the works of Karmarkar, Klivans, and Kothari and Raghavendra, and Yau in 2019. Our techniques expand the sum-of-squares toolkit to show robust certifiability of TV-separated Gaussian clusters in data. This involves giving a low-degree sum-of-squares proof of statements that relate parameter (i.e. mean and covariances) distance to total variation distance by relying only on hypercontractivity and anti-concentration.

Comments:	This version fixes a few typos and includes detailed proofs of the certifiable bounded variance property in Section 8 for natural distributions classes (fixing an issue with a generic lemma that proved such a property for a class of distributions in the previous version)
Subjects:	Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
Cite as:	arXiv:2005.02970 [cs.DS]
	(or arXiv:2005.02970v3 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.2005.02970

Submission history

From: Pravesh K Kothari [view email]
[v1] Wed, 6 May 2020 17:24:27 UTC (98 KB)
[v2] Wed, 13 May 2020 17:37:14 UTC (108 KB)
[v3] Mon, 14 Dec 2020 18:00:59 UTC (140 KB)

Computer Science > Data Structures and Algorithms

Title:Outlier-Robust Clustering of Non-Spherical Mixtures

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:Outlier-Robust Clustering of Non-Spherical Mixtures

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators