Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Rashidi, Saeed; Won, William; Srinivasan, Sudarshan; Sridharan, Srinivas; Krishna, Tushar

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2110.04478v1 (cs)

[Submitted on 9 Oct 2021 (this version), latest version 7 Jul 2022 (v3)]

Title:Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Authors:Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna

View PDF

Abstract:The continuous growth in both size and training data for modern Deep Neural Networks (DNNs) models has led to training tasks taking days or even months. Distributed training is a solution to reduce training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In today's datacenters, for training at scale, NPUs are connected through multi-dimensional interconnection links with different bandwidth and latency. Hence, keeping all network dimensions busy and maximizing the network BW is a challenging task in such a hybrid network environment, as this work identifies. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of single All-Reduce by 1.88x (2.92x max), and improve the end-to-end training iteration performance of real workloads such as ResNet-50, GNMT, DLRM, and Transformer- 1T by 1.49x (1.96x max), 1.41x (1.81x max), 1.42x (1.80x max), and 1.35x (1.78x max), respectively.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Cite as:	arXiv:2110.04478 [cs.DC]
	(or arXiv:2110.04478v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2110.04478

Submission history

From: Saeed Rashidi [view email]
[v1] Sat, 9 Oct 2021 06:50:04 UTC (2,970 KB)
[v2] Wed, 4 May 2022 00:46:03 UTC (3,787 KB)
[v3] Thu, 7 Jul 2022 04:20:56 UTC (3,843 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators