Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs > arXiv:1904.11812

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:1904.11812 (cs)
[Submitted on 26 Apr 2019 (v1), last revised 27 Sep 2019 (this version, v2)]

Title:A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

Authors:George K. Thiruvathukal, Cameron Christensen, Xiaoyong Jin, François Tessier, Venkatram Vishwanath
View a PDF of the paper titled A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers, by George K. Thiruvathukal and Cameron Christensen and Xiaoyong Jin and Fran\c{c}ois Tessier and Venkatram Vishwanath
View PDF
Abstract:As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory access and data sharing are becoming performance bottlenecks. Cloud computing employs a data processing paradigm typically built on a loosely connected group of low-cost computing nodes without relying upon shared storage and/or memory. Apache Spark is a popular engine for large-scale data analysis in the cloud, which we have successfully deployed via job submission scripts on production clusters.
In this paper, we describe common parallel analysis dataflows for both Message Passing Interface (MPI) and cloud based applications. We developed an effective benchmark to measure the performance characteristics of these tasks using both types of systems, specifically comparing MPI/C-based analyses with Spark. The benchmark is a data processing pipeline representative of a typical analytics framework implemented using map-reduce. In the case of Spark, we also consider whether language plays a role by writing tests using both Python and Scala, a language built on the Java Virtual Machine (JVM). We include performance results from two large systems at Argonne National Laboratory including Theta, a Cray XC40 supercomputer on which our experiments run with 65,536 cores (1024 nodes with 64 cores each). The results of our experiments are discussed in the context of their applicability to future HPC architectures. Beyond understanding performance, our work demonstrates that technologies such as Spark, while typically aimed at multi-tenant cloud-based environments, show promise for data analysis needs in a traditional clustering/supercomputing environment.
Comments: Submitted to IEEE Cloud 2019
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
Cite as: arXiv:1904.11812 [cs.DC]
  (or arXiv:1904.11812v2 [cs.DC] for this version)
  https://doi.org/10.48550/arXiv.1904.11812
arXiv-issued DOI via DataCite

Submission history

From: George K. Thiruvathukal [view email]
[v1] Fri, 26 Apr 2019 12:52:02 UTC (2,615 KB)
[v2] Fri, 27 Sep 2019 22:26:39 UTC (2,634 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers, by George K. Thiruvathukal and Cameron Christensen and Xiaoyong Jin and Fran\c{c}ois Tessier and Venkatram Vishwanath
  • View PDF
  • TeX Source
  • Other Formats
license icon view license
Current browse context:
cs.DC
< prev   |   next >
new | recent | 2019-04
Change to browse by:
cs
cs.SE

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

listing | bibtex
George K. Thiruvathukal
Cameron Christensen
Xiaoyong Jin
François Tessier
Venkatram Vishwanath
export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack