Skip to main content

Showing 1–1 of 1 results for author: Périz, I S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.17096  [pdf, other

    cs.DC

    Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters

    Authors: Foteini Strati, Zhendong Zhang, George Manos, Ixeia Sánchez Périz, Qinghao Hu, Tiancheng Chen, Berk Buzcu, Song Han, Pamela Delgado, Ana Klimovic

    Abstract: The high GPU demand of ML training makes it hard to allocate large homogeneous clusters of high-end GPUs in a single availability zone. Leveraging heterogeneous GPUs available within and across zones can improve throughput at a reasonable cost. However, training ML models on heterogeneous resources introduces significant challenges, such as stragglers and a large search space of possible job confi… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.