Exact and Approximate Hierarchical Clustering Using A*

Greenberg, Craig S.; Macaluso, Sebastian; Monath, Nicholas; Dubey, Avinava; Flaherty, Patrick; Zaheer, Manzil; Ahmed, Amr; Cranmer, Kyle; McCallum, Andrew

Computer Science > Machine Learning

arXiv:2104.07061 (cs)

[Submitted on 14 Apr 2021]

Title:Exact and Approximate Hierarchical Clustering Using A*

Authors:Craig S. Greenberg, Sebastian Macaluso, Nicholas Monath, Avinava Dubey, Patrick Flaherty, Manzil Zaheer, Amr Ahmed, Kyle Cranmer, Andrew McCallum

View PDF

Abstract:Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel \emph{trellis} data structure. This combination results in an exact algorithm that scales beyond previous state of the art, from a search space with $10^{12}$ trees to $10^{15}$ trees, and an approximate algorithm that improves over baselines, even in enormous search spaces that contain more than $10^{1000}$ trees. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.

Comments:	30 pages, 9 figures
Subjects:	Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
Cite as:	arXiv:2104.07061 [cs.LG]
	(or arXiv:2104.07061v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2104.07061

Submission history

From: Sebastian Macaluso [view email]
[v1] Wed, 14 Apr 2021 18:15:27 UTC (193 KB)

Computer Science > Machine Learning

Title:Exact and Approximate Hierarchical Clustering Using A*

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Exact and Approximate Hierarchical Clustering Using A*

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators