Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Fernandez, Jared; Wehrstedt, Luca; Shamis, Leonid; Elhoushi, Mostafa; Saladi, Kalyan; Bisk, Yonatan; Strubell, Emma; Kahn, Jacob

Computer Science > Machine Learning

arXiv:2411.13055 (cs)

[Submitted on 20 Nov 2024 (v1), last revised 12 Apr 2025 (this version, v2)]

Title:Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Authors:Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn

View PDF HTML (experimental)

Abstract:Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.

Subjects:	Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2411.13055 [cs.LG]
	(or arXiv:2411.13055v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2411.13055

Submission history

From: Jared Fernandez [view email]
[v1] Wed, 20 Nov 2024 06:05:11 UTC (1,064 KB)
[v2] Sat, 12 Apr 2025 19:46:24 UTC (2,064 KB)

Computer Science > Machine Learning

Title:Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators