CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures

Hu, Weifang; Shi, Xuanhua; Wu, Chang; Zhang, Yunkai; Peng, Xuan; Zhai, Jiaqi; Jin, Hai; Zhou, Yongluan; Qian, Xuehai

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2504.00598 (cs)

[Submitted on 1 Apr 2025]

Title:CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures

Authors:Weifang Hu, Xuanhua Shi, Chang Wu, Yunkai Zhang, Xuan Peng, Jiaqi Zhai, Hai Jin, Yongluan Zhou, Xuehai Qian

View PDF HTML (experimental)

Abstract:This paper introduces CFP, a system that search intra-operator parallelism configurations by leveraging runtime profiles of actual parallel programs. The key idea is to profile a limited space by identifying a new structure named ParallelBlock, which is a group of operators with the property of communication-free tensor partition propagation: the partition of its input tensor can propagate through all operators to its output tensor without introducing communication or synchronization. Based on this property, an optimal tensor partition of operators within a ParallelBlock should be inferred from the partition of input tensor through partition propagation to prevent the avoidable communication. Thus, the search space can be reduced by only profiling each ParallelBlock with different input tensor partitions at its entry, instead of enumerating all combinations among operators within the ParallelBlock. Moreover, the search space is further reduced by identifying ParallelBlock sequences (segments) with similar parallel behavior. CFP computes the overall performance of the model based on the profiles of all segments. On GPT, LLAMA, and MoE models, CFP achieves up to a 1.51x, 1.31x, and 3.43x speedup over the state-of-the-art framework, Alpa.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2504.00598 [cs.DC]
	(or arXiv:2504.00598v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2504.00598

Submission history

From: Weifang Hu [view email]
[v1] Tue, 1 Apr 2025 09:56:58 UTC (2,643 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators