Skip to main content

Showing 1–2 of 2 results for author: Spišaková, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.16631  [pdf, other

    cs.DC

    CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

    Authors: Radostin Stoyanov, Viktória Spišaková, Jesus Ramos, Steven Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, Rodrigo Bruno

    Abstract: Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant environments where compute resources are shared, and job preemptions or interruptions are common. However, transparent and unified GPU snapshots are particularly c… ▽ More

    Submitted 23 February, 2025; originally announced February 2025.

  2. arXiv:2207.02531  [pdf, other

    cs.DC

    A Kubernetes 'Bridge' operator between cloud and external resources

    Authors: Boris Lublinsky, Elise Jennings, Viktória Spišaková

    Abstract: Many scientific workflows require dedicated compute resources, including HPC clusters with optimized software, quantum resources, and dedicated hardware cluster systems like Ray, for example. At the same time, many scientific workflows today are built on Kubernetes leveraging growing support for workflow and support tools. To address the growing demand to support workflows on both cloud and dedica… ▽ More

    Submitted 6 July, 2022; originally announced July 2022.

    Comments: 13 pages, 2 figures