-
Report on Challenges of Practical Reproducibility for Systems and HPC Computer Science
Authors:
Kate Keahey,
Marc Richardson,
Rafael Tolosana Calasanz,
Sascha Hunold,
Jay Lofstead,
Tanu Malik,
Christian Perez
Abstract:
This report synthesizes findings from the November 2024 Community Workshop on Practical Reproducibility in HPC, which convened researchers, artifact authors, reviewers, and chairs of reproducibility initiatives to address the critical challenge of making computational experiments reproducible in a cost-effective manner. The workshop deliberately focused on systems and HPC computer science research…
▽ More
This report synthesizes findings from the November 2024 Community Workshop on Practical Reproducibility in HPC, which convened researchers, artifact authors, reviewers, and chairs of reproducibility initiatives to address the critical challenge of making computational experiments reproducible in a cost-effective manner. The workshop deliberately focused on systems and HPC computer science research due to its unique requirements, including specialized hardware access and deep system reconfigurability. Through structured discussions, lightning talks, and panel sessions, participants identified key barriers to practical reproducibility and formulated actionable recommendations for the community.
The report presents a dual framework of challenges and recommendations organized by target audience (authors, reviewers, organizations, and community). It characterizes technical obstacles in experiment packaging and review, including completeness of artifact descriptions, acquisition of specialized hardware, and establishing reproducibility conditions. The recommendations range from immediate practical tools (comprehensive checklists for artifact packaging) to ecosystem-level improvements (refining badge systems, creating artifact digital libraries, and developing AI-assisted environment creation). Rather than advocating for reproducibility regardless of cost, the report emphasizes striking an appropriate balance between reproducibility rigor and practical feasibility, positioning reproducibility as an integral component of scientific exploration rather than a burdensome afterthought. Appendices provide detailed, immediately actionable checklists for authors and reviewers to improve reproducibility practices across the HPC community.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Performance Models for a Two-tiered Storage System
Authors:
Aparna Sasidharan,
Xian-He,
Jay Lofstead,
Scott Klasky
Abstract:
This work describes the design, implementation and performance analysis of a distributed two-tiered storage software. The first tier functions as a distributed software cache implemented using solid-state devices~(NVMes) and the second tier consists of multiple hard disks~(HDDs). We describe an online learning algorithm that manages data movement between the tiers. The software is hybrid, i.e. bot…
▽ More
This work describes the design, implementation and performance analysis of a distributed two-tiered storage software. The first tier functions as a distributed software cache implemented using solid-state devices~(NVMes) and the second tier consists of multiple hard disks~(HDDs). We describe an online learning algorithm that manages data movement between the tiers. The software is hybrid, i.e. both distributed and multi-threaded. The end-to-end performance model of the two-tier system was developed using queuing networks and behavioral models of storage devices. We identified significant parameters that affect the performance of storage devices and created behavioral models for each device. The performance of the software was evaluated on a many-core cluster using non-trivial read/write workloads. The paper provides examples to illustrate the use of these models.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Exploring Spatial Indexing for Accelerated Feature Retrieval in HPC
Authors:
Margaret Lawson,
William Gropp,
Jay Lofstead
Abstract:
Despite the critical role that range queries play in analysis and visualization for HPC applications, there has been no comprehensive analysis of indices that are designed to accelerate range queries and the extent to which they are viable in an HPC setting. In this state of the practice paper we present the first such evaluation, examining 20 open-source C and C++ libraries that support range que…
▽ More
Despite the critical role that range queries play in analysis and visualization for HPC applications, there has been no comprehensive analysis of indices that are designed to accelerate range queries and the extent to which they are viable in an HPC setting. In this state of the practice paper we present the first such evaluation, examining 20 open-source C and C++ libraries that support range queries. Contributions of this paper include answering the following questions: which of the implementations are viable in an HPC setting, how do these libraries compare in terms of build time, query time, memory usage, and scalability, what are other trade-offs between these implementations, is there a single overall best solution, and when does a brute force solution offer the best performance? We also share key insights learned during this process that can assist both HPC application scientists and spatial index developers.
△ Less
Submitted 18 August, 2021; v1 submitted 26 June, 2021;
originally announced June 2021.
-
Building Containerized Environments for Reproducibility and Traceability of Scientific Workflows
Authors:
Paula Olaya,
Jay Lofstead,
Michela Taufer
Abstract:
Scientists rely on simulations to study natural phenomena. Trusting the simulation results is vital to develop sciences in any field. One approach to build trust is to ensure the reproducibility and traceability of the simulations through the annotation of executions at the system-level; by the generation of record trails of data moving through the simulation workflow. In this work, we present a s…
▽ More
Scientists rely on simulations to study natural phenomena. Trusting the simulation results is vital to develop sciences in any field. One approach to build trust is to ensure the reproducibility and traceability of the simulations through the annotation of executions at the system-level; by the generation of record trails of data moving through the simulation workflow. In this work, we present a system-level solution that leverages the intrinsic characteristics of containers (i.e., portability, isolation, encapsulation, and unique identifiers). Our solution consists of a containerized environment capable to annotate workflows, capture provenance metadata, and build record trails. We assess our environment on four different workflows and measure containerization costs in terms of time and space. Our solution, built with a tolerable time and space overhead, enables transparent and automatic provenance metadata collection and access, an easy-to-read record trail, and tight connections between data and metadata.
△ Less
Submitted 17 September, 2020;
originally announced September 2020.
-
Data Pallets: Containerizing Storage For Reproducibility and Traceability
Authors:
Jay Lofstead,
Joshua Baker,
Andrew Younge
Abstract:
Trusting simulation output is crucial for Sandia's mission objectives. We rely on these simulations to perform our high-consequence mission tasks given national treaty obligations. Other science and modeling applications, while they may have high-consequence results, still require the strongest levels of trust to enable using the result as the foundation for both practical applications and future…
▽ More
Trusting simulation output is crucial for Sandia's mission objectives. We rely on these simulations to perform our high-consequence mission tasks given national treaty obligations. Other science and modeling applications, while they may have high-consequence results, still require the strongest levels of trust to enable using the result as the foundation for both practical applications and future research. To this end, the computing community has developed workflow and provenance systems to aid in both automating simulation and modeling execution as well as determining exactly how was some output was created so that conclusions can be drawn from the data.
Current approaches for workflows and provenance systems are all at the user level and have little to no system level support making them fragile, difficult to use, and incomplete solutions. The introduction of container technology is a first step towards encapsulating and tracking artifacts used in creating data and resulting insights, but their current implementation is focused solely on making it easy to deploy an application in an isolated "sandbox" and maintaining a strictly read-only mode to avoid any potential changes to the application. All storage activities are still using the system-level shared storage.
This project explores extending the container concept to include storage as a new container type we call \emph{data pallets}. Data Pallets are potentially writeable, auto generated by the system based on IO activities, and usable as a way to link the contained data back to the application and input deck used to create it.
△ Less
Submitted 7 November, 2018;
originally announced November 2018.
-
Distributed Versioned Object Storage -- Alternatives at the OSD layer (Poster Extended Abstract)
Authors:
Ivo Jimenez,
Carlos Maltzahn,
Jay Lofstead
Abstract:
The ability to store multiple versions of a data item is a powerful primitive that has had a wide variety of uses: relational databases, transactional memory, version control systems, to name a few. However, each implementation uses a very particular form of versioning that is customized to the domain in question and hidden away from the user. In our going project, we are reviewing and analyzing m…
▽ More
The ability to store multiple versions of a data item is a powerful primitive that has had a wide variety of uses: relational databases, transactional memory, version control systems, to name a few. However, each implementation uses a very particular form of versioning that is customized to the domain in question and hidden away from the user. In our going project, we are reviewing and analyzing multiple uses of versioning in distinct domains, with the goal of identifying the basic components required to provide a generic distributed multiversioning object storage service, and define how these can be customized in order to serve distinct needs. With this primitive, new services can leverage multiversioning to ease development and provide specific consistency guarantees that address particular use cases. This work presents early results that quantify the trade-offs in implementing versioning at the local storage layer.
△ Less
Submitted 14 June, 2014;
originally announced June 2014.