Distributed Out-of-Memory NMF on CPU/GPU Architectures

Boureima, Ismael; Bhattarai, Manish; Eren, Maksim; Skau, Erik; Romero, Philip; Eidenbenz, Stephan; Alexandrov, Boian

doi:10.1007/s11227-023-05587-4

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2202.09518 (cs)

[Submitted on 19 Feb 2022 (v1), last revised 12 Sep 2023 (this version, v4)]

Title:Distributed Out-of-Memory NMF on CPU/GPU Architectures

Authors:Ismael Boureima, Manish Bhattarai, Maksim Eren, Erik Skau, Philip Romero, Stephan Eidenbenz, Boian Alexandrov

View PDF

Abstract:We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library NCCL based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10e-6.

Comments:	Accepted at Journal of Supercomputing
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2202.09518 [cs.DC]
	(or arXiv:2202.09518v4 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2202.09518
Related DOI:	https://doi.org/10.1007/s11227-023-05587-4

Submission history

From: Manish Bhattarai [view email]
[v1] Sat, 19 Feb 2022 03:49:21 UTC (3,133 KB)
[v2] Wed, 16 Mar 2022 16:22:37 UTC (3,130 KB)
[v3] Thu, 10 Aug 2023 04:11:51 UTC (4,327 KB)
[v4] Tue, 12 Sep 2023 23:16:07 UTC (4,327 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Distributed Out-of-Memory NMF on CPU/GPU Architectures

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Distributed Out-of-Memory NMF on CPU/GPU Architectures

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators