ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Xu, Yiheng; Sivaraman, Pranav; Devarajan, Hariharan; Mohror, Kathryn; Bhatele, Abhinav

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2312.06131v2 (cs)

[Submitted on 11 Dec 2023 (v1), last revised 12 Jan 2024 (this version, v2)]

Title:ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Authors:Yiheng Xu, Pranav Sivaraman, Hariharan Devarajan, Kathryn Mohror, Abhinav Bhatele

View PDF HTML (experimental)

Abstract:Parallel applications can spend a significant amount of time performing I/O on large-scale supercomputers. Fast near-compute storage accelerators called burst buffers can reduce the time a processor spends performing I/O and mitigate I/O bottlenecks. However, determining if a given application could be accelerated using burst buffers is not straightforward even for storage experts. The relationship between an application's I/O characteristics (such as I/O volume, processes involved, etc.) and the best storage sub-system for it can be complicated. As a result, adapting parallel applications to use burst buffers efficiently is a trial-and-error process. In this work, we present a Python-based tool called PrismIO that enables programmatic analysis of I/O traces. Using PrismIO, we identify bottlenecks on burst buffers and parallel file systems and explain why certain I/O patterns perform poorly. Further, we use machine learning to model the relationship between I/O characteristics and burst buffer selections. We run IOR (an I/O benchmark) with various I/O characteristics on different storage systems and collect performance data. We use the data as the input for training the model. Our model can predict if a file of an application should be placed on BBs for unseen IOR scenarios with an accuracy of 94.47% and for four real applications with an accuracy of 95.86%.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2312.06131 [cs.DC]
	(or arXiv:2312.06131v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2312.06131

Submission history

From: Abhinav Bhatele [view email]
[v1] Mon, 11 Dec 2023 05:33:00 UTC (1,007 KB)
[v2] Fri, 12 Jan 2024 03:32:49 UTC (1,007 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators