Document Clustering Evaluation: Divergence from a Random Baseline

De Vries, Christopher M.; Geva, Shlomo; Trotman, Andrew

Computer Science > Information Retrieval

arXiv:1208.5654 (cs)

[Submitted on 28 Aug 2012 (v1), last revised 29 Aug 2012 (this version, v2)]

Title:Document Clustering Evaluation: Divergence from a Random Baseline

Authors:Christopher M. De Vries, Shlomo Geva, Andrew Trotman

View PDF

Abstract:Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed using intrinsic and extrinsic approaches to the evaluation of document cluster quality. This includes the classical clusters to categories approach and a novel approach that uses ad hoc information retrieval. The divergence from a random baseline approach is able to differentiate ineffective clusterings encountered in the INEX XML Mining track. It also appears to perform a normalisation similar to the Normalised Mutual Information (NMI) measure but it can be applied to any measure of cluster quality. When it is applied to the intrinsic measure of distortion as measured by RMSE, subtraction from a random baseline provides a clear optimum that is not apparent otherwise. This approach can be applied to any clustering evaluation. This paper describes its use in the context of document clustering evaluation.

Comments:	8 pages, 11 figures, WIR2012
Subjects:	Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Cite as:	arXiv:1208.5654 [cs.IR]
	(or arXiv:1208.5654v2 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.1208.5654

Submission history

From: Chris De Vries [view email]
[v1] Tue, 28 Aug 2012 13:23:29 UTC (440 KB)
[v2] Wed, 29 Aug 2012 09:04:41 UTC (434 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.IR

< prev | next >

new | recent | 2012-08

Change to browse by:

cs
cs.AI

References & Citations

DBLP - CS Bibliography

listing | bibtex

Christopher M. De Vries
Shlomo Geva
Andrew Trotman

export BibTeX citation

Computer Science > Information Retrieval

Title:Document Clustering Evaluation: Divergence from a Random Baseline

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Document Clustering Evaluation: Divergence from a Random Baseline

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators