Skip to main content

Showing 1–1 of 1 results for author: Vagin, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.16631  [pdf, other

    cs.DC

    CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

    Authors: Radostin Stoyanov, Viktória Spišaková, Jesus Ramos, Steven Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, Rodrigo Bruno

    Abstract: Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant environments where compute resources are shared, and job preemptions or interruptions are common. However, transparent and unified GPU snapshots are particularly c… ▽ More

    Submitted 23 February, 2025; originally announced February 2025.