Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters
Authors:
Foteini Strati,
Zhendong Zhang,
George Manos,
Ixeia Sánchez Périz,
Qinghao Hu,
Tiancheng Chen,
Berk Buzcu,
Song Han,
Pamela Delgado,
Ana Klimovic
Abstract:
The high GPU demand of ML training makes it hard to allocate large homogeneous clusters of high-end GPUs in a single availability zone. Leveraging heterogeneous GPUs available within and across zones can improve throughput at a reasonable cost. However, training ML models on heterogeneous resources introduces significant challenges, such as stragglers and a large search space of possible job confi…
▽ More
The high GPU demand of ML training makes it hard to allocate large homogeneous clusters of high-end GPUs in a single availability zone. Leveraging heterogeneous GPUs available within and across zones can improve throughput at a reasonable cost. However, training ML models on heterogeneous resources introduces significant challenges, such as stragglers and a large search space of possible job configurations. Current systems lack support for efficiently training models on heterogeneous resources. We present Sailor, a system that automates distributed training over heterogeneous, geo-distributed, and dynamically available resources. Sailor combines an efficient search space exploration algorithm, accurate runtime and memory footprint simulation, and a distributed training framework that supports different types of heterogeneity to optimize training throughput and cost.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.