Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications

de Vries, Christopher M.; De Vine, Lance; Geva, Shlomo; Nayak, Richi

doi:10.1145/2736277.2741111

Computer Science > Information Retrieval

arXiv:1505.05613 (cs)

[Submitted on 21 May 2015]

Title:Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications

Authors:Christopher M. de Vries, Lance De Vine, Shlomo Geva, Richi Nayak

View PDF

Abstract:The proliferation of the web presents an unsolved problem of automatically analyzing billions of pages of natural language. We introduce a scalable algorithm that clusters hundreds of millions of web pages into hundreds of thousands of clusters. It does this on a single mid-range machine using efficient algorithms and compressed document representations. It is applied to two web-scale crawls covering tens of terabytes. ClueWeb09 and ClueWeb12 contain 500 and 733 million web pages and were clustered into 500,000 to 700,000 clusters. To the best of our knowledge, such fine grained clustering has not been previously demonstrated. Previous approaches clustered a sample that limits the maximum number of discoverable clusters. The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms. Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size. These fine-grained clusters show an improved cluster quality when assessed with two novel evaluations using ad hoc search relevance judgments and spam classifications for external validation. These evaluations solve the problem of assessing the quality of clusters where categorical labeling is unavailable and unfeasible.

Comments:	11 pages, WWW 2015
Subjects:	Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
ACM classes:	H.3.3; I.5.3; D.1.3
Cite as:	arXiv:1505.05613 [cs.IR]
	(or arXiv:1505.05613v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.1505.05613
Related DOI:	https://doi.org/10.1145/2736277.2741111

Submission history

From: Chris De Vries [view email]
[v1] Thu, 21 May 2015 06:22:04 UTC (510 KB)

Computer Science > Information Retrieval

Title:Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators