In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Perera, Niranda; Sarker, Arup Kumar; Staylor, Mills; von Laszewski, Gregor; Shan, Kaiying; Kamburugamuve, Supun; Widanage, Chathura; Abeykoon, Vibhatha; Kanewela, Thejaka Amila; Fox, Geoffrey

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2307.01394 (cs)

[Submitted on 3 Jul 2023]

Title:In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Authors:Niranda Perera, Arup Kumar Sarker, Mills Staylor, Gregor von Laszewski, Kaiying Shan, Supun Kamburugamuve, Chathura Widanage, Vibhatha Abeykoon, Thejaka Amila Kanewela, Geoffrey Fox

View PDF

Abstract:The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its e fficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Report number:	FGCS-D-23-00577R1
Cite as:	arXiv:2307.01394 [cs.DC]
	(or arXiv:2307.01394v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2307.01394

Submission history

From: Niranda Perera [view email]
[v1] Mon, 3 Jul 2023 23:11:03 UTC (4,224 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators