DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Fan, Yuping; Lan, Zhiling

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2105.07526 (cs)

[Submitted on 16 May 2021]

Title:DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Authors:Yuping Fan, Zhiling Lan

View PDF

Abstract:For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly diverse workloads make such manual process challenging, time-consuming, and error-prone. We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2105.07526 [cs.DC]
	(or arXiv:2105.07526v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2105.07526
Journal reference:	Software Impacts 2021

Submission history

From: Yuping Fan [view email]
[v1] Sun, 16 May 2021 21:56:31 UTC (18,843 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DC

< prev | next >

new | recent | 2021-05

Change to browse by:

cs
cs.AI

References & Citations

DBLP - CS Bibliography

listing | bibtex

export BibTeX citation

Computer Science > Distributed, Parallel, and Cluster Computing

Title:DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators