PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Authors:
Yanli Zhao,
Andrew Gu,
Rohan Varma,
Liang Luo,
Chien-Chin Huang,
Min Xu,
Less Wright,
Hamid Shojanazeri,
Myle Ott,
Sam Shleifer,
Alban Desmaison,
Can Balioglu,
Pritam Damania,
Bernard Nguyen,
Geeta Chauhan,
Yuchen Hao,
Ajit Mathews,
Shen Li
Abstract:
It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit tech…
▽ More
It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.
△ Less
Submitted 12 September, 2023; v1 submitted 21 April, 2023;
originally announced April 2023.
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
Authors:
Shen Li,
Yanli Zhao,
Rohan Varma,
Omkar Salpekar,
Pieter Noordhuis,
Teng Li,
Adam Paszke,
Jeff Smith,
Brian Vaughan,
Pritam Damania,
Soumith Chintala
Abstract:
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. D…
▽ More
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.
△ Less
Submitted 28 June, 2020;
originally announced June 2020.