Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Sun, Ao; Zhao, Weilin; Han, Xu; Yang, Cheng; Zhang, Xinrong; Liu, Zhiyuan; Shi, Chuan; Sun, Maosong

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2406.03488 (cs)

[Submitted on 5 Jun 2024 (v1), last revised 11 Nov 2024 (this version, v5)]

Title:Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Authors:Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Xinrong Zhang, Zhiyuan Liu, Chuan Shi, Maosong Sun

View PDF HTML (experimental)

Abstract:The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throughput. To enhance memory efficiency and training throughput, in this work, we introduce an efficient sequence-level one-forward-one-backward (1F1B) pipeline scheduling method tailored for training LLMs on long sequences named Seq1F1B. Seq1F1B decomposes batch-level schedulable units into finer sequence-level units, reducing bubble size and memory footprint. Considering that Seq1F1B may produce slight extra bubbles if sequences are split evenly, we design a computation-wise strategy to partition input sequences and mitigate this side effect. Compared to competitive pipeline baseline methods such as Megatron 1F1B pipeline parallelism, our method achieves higher training throughput with less memory footprint. Notably, Seq1F1B efficiently trains a LLM with 30B parameters on sequences up to 64k using 64 NVIDIA A100 GPUs without recomputation strategies, a feat unachievable with existing methods. Our source code is based on Megatron-LM, and now is avaiable at: this https URL.

Comments:	12 pages, 4 figures, 6 tables
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2406.03488 [cs.DC]
	(or arXiv:2406.03488v5 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2406.03488

Submission history

From: Ao Sun [view email]
[v1] Wed, 5 Jun 2024 17:50:03 UTC (466 KB)
[v2] Thu, 6 Jun 2024 05:48:53 UTC (468 KB)
[v3] Mon, 9 Sep 2024 07:31:36 UTC (468 KB)
[v4] Fri, 8 Nov 2024 09:02:24 UTC (1,075 KB)
[v5] Mon, 11 Nov 2024 01:33:50 UTC (1,013 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators