FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems

Randall, Thomas; Allen, Tyler; Ge, Rong

doi:10.1145/3447818.3460373

Computer Science > Machine Learning

arXiv:2312.07743 (cs)

[Submitted on 12 Dec 2023]

Title:FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems

Authors:Thomas Randall, Tyler Allen, Rong Ge

View PDF HTML (experimental)

Abstract:Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs.
We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89\%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.

Comments:	12 pages, 7 figures, 7 tables, the definitive version of this work is published in the Proceedings of the ACM International Conference on Supercomputing 2021, available at this https URL
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
ACM classes:	I.2.7; D.1.3; G.4
Cite as:	arXiv:2312.07743 [cs.LG]
	(or arXiv:2312.07743v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2312.07743
Journal reference:	Proceedings of the ACM International Conference on Supercomputing (2021) 455-466
Related DOI:	https://doi.org/10.1145/3447818.3460373

Submission history

From: Thomas Randall [view email]
[v1] Tue, 12 Dec 2023 21:22:07 UTC (1,033 KB)

Computer Science > Machine Learning

Title:FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators