An Empirical Evaluation of Allgatherv on Multi-GPU Systems

Rolinger, Thomas B.; Simon, Tyler A.; Krieger, Christopher D.

doi:10.1109/CCGRID.2018.00027

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:1812.05964 (cs)

[Submitted on 14 Dec 2018]

Title:An Empirical Evaluation of Allgatherv on Multi-GPU Systems

Authors:Thomas B. Rolinger, Tyler A. Simon, Christopher D. Krieger

View PDF

Abstract:Applications for deep learning and big data analytics have compute and memory requirements that exceed the limits of a single GPU. However, effectively scaling out an application to multiple GPUs is challenging due to the complexities of communication between the GPUs, particularly for collective communication with irregular message sizes. In this work, we provide a performance evaluation of the Allgatherv routine on multi-GPU systems, focusing on GPU network topology and the communication library used. We present results from the OSU-micro benchmark as well as conduct a case study for sparse tensor factorization, one application that uses Allgatherv with highly irregular message sizes. We extend our existing tensor factorization tool to run on systems with different node counts and varying number of GPUs per node. We then evaluate the communication performance of our tool when using traditional MPI, CUDA-aware MVAPICH and NCCL across a suite of real-world data sets on three different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with 8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in the tensor data sets produce trends that contradict those in the OSU micro-benchmark, as well as trends that are absent from the benchmark.

Comments:	2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Cite as:	arXiv:1812.05964 [cs.DC]
	(or arXiv:1812.05964v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.1812.05964
Related DOI:	https://doi.org/10.1109/CCGRID.2018.00027

Submission history

From: Thomas Rolinger [view email]
[v1] Fri, 14 Dec 2018 14:46:25 UTC (635 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:An Empirical Evaluation of Allgatherv on Multi-GPU Systems

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:An Empirical Evaluation of Allgatherv on Multi-GPU Systems

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators