A Generic Distributed Clustering Framework for Massive Data

Luo, Pingyi; Huang, Qiang; Tung, Anthony K. H.

Computer Science > Databases

arXiv:2106.10515 (cs)

[Submitted on 19 Jun 2021]

Title:A Generic Distributed Clustering Framework for Massive Data

Authors:Pingyi Luo, Qiang Huang, Anthony K. H. Tung

View PDF

Abstract:In this paper, we introduce a novel Generic distributEd clustEring frameworK (GEEK) beyond $k$-means clustering to process massive amounts of data. To deal with different data types, GEEK first converts data in the original feature space into a unified format of buckets; then, we design a new Seeding method based on simILar bucKets (SILK) to determine initial seeds. Compared with state-of-the-art seeding methods such as $k$-means++ and its variants, SILK can automatically identify the number of initial seeds based on the closeness of shared data objects in similar buckets instead of pre-specifying $k$. Thus, its time complexity is independent of $k$. With these well-selected initial seeds, GEEK only needs a one-pass data assignment to get the final clusters. We implement GEEK on a distributed CPU-GPU platform for large-scale clustering. We evaluate the performance of GEEK over five large-scale real-life datasets and show that GEEK can deal with massive data of different types and is comparable to (or even better than) many state-of-the-art customized GPU-based methods, especially in large $k$ values.

Comments:	11 pages, 7 figures
Subjects:	Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2106.10515 [cs.DB]
	(or arXiv:2106.10515v1 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2106.10515

Submission history

From: Qiang Huang [view email]
[v1] Sat, 19 Jun 2021 15:20:21 UTC (3,289 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DB

< prev | next >

new | recent | 2021-06

Change to browse by:

cs
cs.DC

References & Citations

DBLP - CS Bibliography

listing | bibtex

Qiang Huang
Anthony K. H. Tung

export BibTeX citation

Computer Science > Databases

Title:A Generic Distributed Clustering Framework for Massive Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:A Generic Distributed Clustering Framework for Massive Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators