FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

Wang, Yujie; Wang, Shiju; Zhu, Shenhan; Fu, Fangcheng; Liu, Xinyi; Xiao, Xuefeng; Li, Huixia; Li, Jiashi; Wu, Faming; Cui, Bin

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2412.01523 (cs)

[Submitted on 2 Dec 2024 (v1), last revised 11 Feb 2025 (this version, v3)]

Title:FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

Authors:Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, Bin Cui

View PDF HTML (experimental)

Abstract:Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each input sequence across multiple devices and necessitates communication to process the sequence. In essence, existing sequence parallelism methods assume homogeneous sequence lengths (i.e., all input sequences are equal in length) and therefore leverages a single, static scattering strategy for all input sequences. However, in reality, the sequence lengths in LLM training corpora exhibit substantial variability, often following a long-tail distribution, which leads to workload heterogeneity.
In this paper, we show that employing a single, static strategy results in inefficiency and resource under-utilization, highlighting the need for adaptive approaches to handle the heterogeneous workloads across sequences. To address this, we propose a heterogeneity-adaptive sequence parallelism method. For each training step, our approach captures the variability in sequence lengths and assigns the optimal combination of scattering strategies based on workload characteristics. We model this problem as a linear programming optimization and design an efficient and effective solver to find the optimal solution. Furthermore, we implement our method in a high-performance system that supports adaptive parallelization in distributed LLM training. Experimental results demonstrate that our system outperforms state-of-the-art training frameworks by up to 1.98x.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2412.01523 [cs.DC]
	(or arXiv:2412.01523v3 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2412.01523

Submission history

From: Yujie Wang [view email]
[v1] Mon, 2 Dec 2024 14:16:03 UTC (4,314 KB)
[v2] Mon, 10 Feb 2025 12:00:50 UTC (8,070 KB)
[v3] Tue, 11 Feb 2025 07:31:03 UTC (8,045 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators