ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS

Filippini, Federica; Ardagna, Danilo; Lattuada, Marco; Amaldi, Edoardo; Ciavotta, Michele; Riedl, Maciek; Materka, Katarzyna; Skrzypek, Paweł; Magugliani, Fabrizio; Cicala, Marco

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2105.05080 (cs)

[Submitted on 11 May 2021]

Title:ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS

Authors:Federica Filippini, Danilo Ardagna, Marco Lattuada, Edoardo Amaldi, Michele Ciavotta, Maciek Riedl, Katarzyna Materka, Paweł Skrzypek, Fabrizio Magugliani, Marco Cicala

View PDF

Abstract:Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range of products and solutions. DL training jobs are highly resource demanding and they experience great benefits when exploiting AI accelerators (e.g., GPUs). However, the effective management of GPU-powered clusters comes with great challenges. Among these, efficient scheduling and resource allocation solutions are crucial to maximize performance and minimize Data Centers operational costs. In this paper we propose ANDREAS, an advanced scheduling solution that tackles these problems jointly, aiming at optimizing DL training runtime workloads and their energy consumption in accelerated clusters. Experiments based on simulation demostrate that we can achieve a cost reduction between 30 and 62% on average with respect to first-principle methods while the validation on a real cluster shows a worst case deviation below 13% between actual and predicted costs, proving the effectiveness of ANDREAS solution in practical scenarios.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2105.05080 [cs.DC]
	(or arXiv:2105.05080v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2105.05080

Submission history

From: Federica Filippini [view email]
[v1] Tue, 11 May 2021 14:36:19 UTC (1,681 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DC

< prev | next >

new | recent | 2021-05

Change to browse by:

cs
cs.AI

References & Citations

DBLP - CS Bibliography

listing | bibtex

Danilo Ardagna
Michele Ciavotta

export BibTeX citation

Computer Science > Distributed, Parallel, and Cluster Computing

Title:ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators