An Empirical Evaluation of Allgatherv on Multi-GPU Systems
Authors:
Thomas B. Rolinger,
Tyler A. Simon,
Christopher D. Krieger
Abstract:
Applications for deep learning and big data analytics have compute and memory requirements that exceed the limits of a single GPU. However, effectively scaling out an application to multiple GPUs is challenging due to the complexities of communication between the GPUs, particularly for collective communication with irregular message sizes. In this work, we provide a performance evaluation of the A…
▽ More
Applications for deep learning and big data analytics have compute and memory requirements that exceed the limits of a single GPU. However, effectively scaling out an application to multiple GPUs is challenging due to the complexities of communication between the GPUs, particularly for collective communication with irregular message sizes. In this work, we provide a performance evaluation of the Allgatherv routine on multi-GPU systems, focusing on GPU network topology and the communication library used. We present results from the OSU-micro benchmark as well as conduct a case study for sparse tensor factorization, one application that uses Allgatherv with highly irregular message sizes. We extend our existing tensor factorization tool to run on systems with different node counts and varying number of GPUs per node. We then evaluate the communication performance of our tool when using traditional MPI, CUDA-aware MVAPICH and NCCL across a suite of real-world data sets on three different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with 8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in the tensor data sets produce trends that contradict those in the OSU micro-benchmark, as well as trends that are absent from the benchmark.
△ Less
Submitted 14 December, 2018;
originally announced December 2018.
Parallel Sparse Tensor Decomposition in Chapel
Authors:
Thomas B. Rolinger,
Tyler A. Simon,
Christopher D. Krieger
Abstract:
In big-data analytics, using tensor decomposition to extract patterns from large, sparse multivariate data is a popular technique. Many challenges exist for designing parallel, high performance tensor decomposition algorithms due to irregular data accesses and the growing size of tensors that are processed. There have been many efforts at implementing shared-memory algorithms for tensor decomposit…
▽ More
In big-data analytics, using tensor decomposition to extract patterns from large, sparse multivariate data is a popular technique. Many challenges exist for designing parallel, high performance tensor decomposition algorithms due to irregular data accesses and the growing size of tensors that are processed. There have been many efforts at implementing shared-memory algorithms for tensor decomposition, most of which have focused on the traditional C/C++ with OpenMP framework. However, Chapel is becoming an increasingly popular programing language due to its expressiveness and simplicity for writing scalable parallel programs. In this work, we port a state of the art C/OpenMP parallel sparse tensor decomposition tool, SPLATT, to Chapel. We present a performance study that investigates bottlenecks in our Chapel code and discusses approaches for improving its performance. Also, we discuss features in Chapel that would have been beneficial to our porting effort. We demonstrate that our Chapel code is competitive with the C/OpenMP code for both runtime and scalability, achieving 83%-96% performance of the original code and near linear scalability up to 32 cores.
△ Less
Submitted 14 December, 2018;
originally announced December 2018.