Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach

Sun, Peng; Wen, Yonggang; Duong, Ta Nguyen Binh; Yan, Shengen

doi:10.1109/SMARTCOMP.2017.7947053

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:1704.06738 (cs)

[Submitted on 22 Apr 2017]

Title:Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach

Authors:Peng Sun, Yonggang Wen, Ta Nguyen Binh Duong, Shengen Yan

View PDF

Abstract:Many cluster management systems (CMSs) have been proposed to share a single cluster with multiple distributed computing systems. However, none of the existing approaches can handle distributed machine learning (ML) workloads given the following criteria: high resource utilization, fair resource allocation and low sharing overhead. To solve this problem, we propose a new CMS named Dorm, incorporating a dynamically-partitioned cluster management mechanism and an utilization-fairness optimizer. Specifically, Dorm uses the container-based virtualization technique to partition a cluster, runs one application per partition, and can dynamically resize each partition at application runtime for resource efficiency and fairness. Each application directly launches its tasks on the assigned partition without petitioning for resources frequently, so Dorm imposes flat sharing overhead. Extensive performance evaluations showed that Dorm could simultaneously increase the resource utilization by a factor of up to 2.32, reduce the fairness loss by a factor of up to 1.52, and speed up popular distributed ML applications by a factor of up to 2.72, compared to existing approaches. Dorm's sharing overhead is less than 5% in most cases.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:1704.06738 [cs.DC]
	(or arXiv:1704.06738v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.1704.06738
Related DOI:	https://doi.org/10.1109/SMARTCOMP.2017.7947053

Submission history

From: Peng Sun [view email]
[v1] Sat, 22 Apr 2017 03:17:18 UTC (5,148 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators