SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Kim, Han-Byul; Hoang, Duc; Kundu, Arnav; Samragh, Mohammad; Cho, Minsik

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2502.20727 (cs)

[Submitted on 28 Feb 2025 (v1), last revised 1 Jun 2025 (this version, v4)]

Title:SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Authors:Han-Byul Kim, Duc Hoang, Arnav Kundu, Mohammad Samragh, Minsik Cho

View PDF HTML (experimental)

Abstract:With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

Comments:	International Conference on Machine Learning (ICML) 2025
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2502.20727 [cs.DC]
	(or arXiv:2502.20727v4 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2502.20727

Submission history

From: Han-Byul Kim [view email]
[v1] Fri, 28 Feb 2025 05:20:48 UTC (2,784 KB)
[v2] Sun, 4 May 2025 07:48:11 UTC (2,785 KB)
[v3] Wed, 21 May 2025 04:23:44 UTC (2,785 KB)
[v4] Sun, 1 Jun 2025 00:33:25 UTC (2,785 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators