OCCL: a Deadlock-free Library for GPU Collective Communication

Pan, Lichen; Liu, Juncheng; Yuan, Jinhui; Zhang, Rongkai; Li, Pengze; Xiao, Zhen

Abstract:Various distributed deep neural network (DNN) training technologies lead to increasingly complicated use of collective communications on GPU. The deadlock-prone collectives on GPU force researchers to guarantee that collectives are enqueued in a consistent order on each GPU to prevent deadlocks. In complex distributed DNN training scenarios, manual hardcoding is the only practical way for deadlock prevention, which poses significant challenges to the development of artificial intelligence. This paper presents OCCL, which is, to the best of our knowledge, the first deadlock-free collective communication library for GPU supporting dynamic decentralized preemption and gang-scheduling for collectives. Leveraging the preemption opportunity of collectives on GPU, OCCL dynamically preempts collectives in a decentralized way via the deadlock-free collective execution framework and allows dynamic decentralized gang-scheduling via the stickiness adjustment scheme. With the help of OCCL, researchers no longer have to struggle to get all GPUs to launch collectives in a consistent order to prevent deadlocks. We implement OCCL with several optimizations and integrate OCCL with a distributed deep learning framework OneFlow. Experimental results demonstrate that OCCL achieves comparable or better latency and bandwidth for collectives compared to NCCL, the state-of-the-art. When used in distributed DNN training, OCCL can improve the peak training throughput by up to 78% compared to statically sequenced NCCL, while introducing overheads of less than 6.5% across various distributed DNN training approaches.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
Cite as:	arXiv:2303.06324 [cs.DC]
	(or arXiv:2303.06324v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2303.06324

Computer Science > Distributed, Parallel, and Cluster Computing

Title:OCCL: a Deadlock-free Library for GPU Collective Communication

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators