Blaze: Simplified High Performance Cluster Computing

Li, Junhao; Zhang, Hang

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:1902.01437 (cs)

[Submitted on 4 Feb 2019 (v1), last revised 6 Feb 2019 (this version, v2)]

Title:Blaze: Simplified High Performance Cluster Computing

Authors:Junhao Li, Hang Zhang

View PDF

Abstract:MapReduce and its variants have significantly simplified and accelerated the process of developing parallel programs. However, most MapReduce implementations focus on data-intensive tasks while many real-world tasks are compute intensive and their data can fit distributedly into the memory. For these tasks, the speed of MapReduce programs can be much slower than those hand-optimized ones. We present Blaze, a C++ library that makes it easy to develop high performance parallel programs for such compute intensive tasks. At the core of Blaze is a highly-optimized in-memory MapReduce function, which has three main improvements over conventional MapReduce implementations: eager reduction, fast serialization, and special treatment for a small fixed key range. We also offer additional conveniences that make developing parallel programs similar to developing serial programs. These improvements make Blaze an easy-to-use cluster computing library that approaches the speed of hand-optimized parallel code. We apply Blaze to some common data mining tasks, including word frequency count, PageRank, k-means, expectation maximization (Gaussian mixture model), and k-nearest neighbors. Blaze outperforms Apache Spark by more than 10 times on average for these tasks, and the speed of Blaze scales almost linearly with the number of nodes. In addition, Blaze uses only the MapReduce function and 3 utility functions in its implementation while Spark uses almost 30 different parallel primitives in its official implementation.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
Cite as:	arXiv:1902.01437 [cs.DC]
	(or arXiv:1902.01437v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.1902.01437

Submission history

From: Junhao Li [view email]
[v1] Mon, 4 Feb 2019 19:28:15 UTC (1,505 KB)
[v2] Wed, 6 Feb 2019 02:59:19 UTC (1,409 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Blaze: Simplified High Performance Cluster Computing

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Blaze: Simplified High Performance Cluster Computing

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators