Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

Sarker, Arup Kumar; Alsaadi, Aymen; Halpern, Alexander James; Tangella, Prabhath; Titov, Mikhail; Perera, Niranda; Staylor, Mills; von Laszewski, Gregor; Jha, Shantenu; Fox, Geoffrey

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2502.20724 (cs)

[Submitted on 28 Feb 2025 (v1), last revised 22 Apr 2025 (this version, v2)]

Title:Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

Authors:Arup Kumar Sarker, Aymen Alsaadi, Alexander James Halpern, Prabhath Tangella, Mikhail Titov, Niranda Perera, Mills Staylor, Gregor von Laszewski, Shantenu Jha, Geoffrey Fox

View PDF HTML (experimental)

Abstract:Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer distributed execution environments, open-source alternatives that integrate scalable runtime tools, deep learning and data frameworks on high-performance computing platforms remain crucial for accessibility and flexibility. In this paper, we introduce Deep Radical-Cylon(RC), a heterogeneous runtime system that combines data engineering, deep learning frameworks, and workflow engines across several HPC environments, including cloud and supercomputing infrastructures. Deep RC supports heterogeneous systems with accelerators, allows the usage of communication libraries like MPI, GLOO and NCCL across multi-node setups, and facilitates parallel and distributed deep learning pipelines by utilizing Radical Pilot as a task execution framework. By attaining an end-to-end pipeline including preprocessing, model training, and postprocessing with 11 neural forecasting models (PyTorch) and hydrology models (TensorFlow) under identical resource conditions, the system reduces 3.28 and 75.9 seconds, respectively. The design of Deep RC guarantees the smooth integration of scalable data frameworks, such as Cylon, with deep learning processes, exhibiting strong performance on cloud platforms and scientific HPC systems. By offering a flexible, high-performance solution for resource-intensive applications, this method closes the gap between data preprocessing, model training, and postprocessing.

Comments:	13 pages, 9 figures, 4 tables
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
ACM classes:	H.2.4; D.2.7; D.2.2
Cite as:	arXiv:2502.20724 [cs.DC]
	(or arXiv:2502.20724v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2502.20724

Submission history

From: Arup Kumar Sarker [view email]
[v1] Fri, 28 Feb 2025 05:16:42 UTC (8,820 KB)
[v2] Tue, 22 Apr 2025 16:35:29 UTC (8,820 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators