The Artificial Scientist -- in-transit Machine Learning of Plasma Simulations
Authors:
Jeffrey Kelling,
Vicente Bolea,
Michael Bussmann,
Ankush Checkervarty,
Alexander Debus,
Jan Ebert,
Greg Eisenhauer,
Vineeth Gutta,
Stefan Kesselheim,
Scott Klasky,
Richard Pausch,
Norbert Podhorszki,
Franz Poschel,
David Rogers,
Jeyhun Rustamov,
Steve Schmerler,
Ulrich Schramm,
Klaus Steiniger,
Rene Widera,
Anna Willmann,
Sunita Chandrasekaran
Abstract:
Increasing HPC cluster sizes and large-scale simulations that produce petabytes of data per run, create massive IO and storage challenges for analysis. Deep learning-based techniques, in particular, make use of these amounts of domain data to extract patterns that help build scientific understanding. Here, we demonstrate a streaming workflow in which simulation data is streamed directly to a machi…
▽ More
Increasing HPC cluster sizes and large-scale simulations that produce petabytes of data per run, create massive IO and storage challenges for analysis. Deep learning-based techniques, in particular, make use of these amounts of domain data to extract patterns that help build scientific understanding. Here, we demonstrate a streaming workflow in which simulation data is streamed directly to a machine-learning (ML) framework, circumventing the file system bottleneck. Data is transformed in transit, asynchronously to the simulation and the training of the model. With the presented workflow, data operations can be performed in common and easy-to-use programming languages, freeing the application user from adapting the application output routines. As a proof-of-concept we consider a GPU accelerated particle-in-cell (PIConGPU) simulation of the Kelvin- Helmholtz instability (KHI). We employ experience replay to avoid catastrophic forgetting in learning from this non-steady process in a continual manner. We detail challenges addressed while porting and scaling to Frontier exascale system.
△ Less
Submitted 15 January, 2025; v1 submitted 6 January, 2025;
originally announced January 2025.
Streaming Data in HPC Workflows Using ADIOS
Authors:
Greg Eisenhauer,
Norbert Podhorszki,
Ana Gainaru,
Scott Klasky,
Philip E. Davis,
Manish Parashar,
Matthew Wolf,
Eric Suchtya,
Erick Fredj,
Vicente Bolea,
Franz Pöschel,
Klaus Steiniger,
Michael Bussmann,
Richard Pausch,
Sunita Chandrasekaran
Abstract:
The "IO Wall" problem, in which the gap between computation rate and data access rate grows continuously, poses significant problems to scientific workflows which have traditionally relied upon using the filesystem for intermediate storage between workflow stages. One way to avoid this problem in scientific workflows is to stream data directly from producers to consumers and avoiding storage entir…
▽ More
The "IO Wall" problem, in which the gap between computation rate and data access rate grows continuously, poses significant problems to scientific workflows which have traditionally relied upon using the filesystem for intermediate storage between workflow stages. One way to avoid this problem in scientific workflows is to stream data directly from producers to consumers and avoiding storage entirely. However, the manner in which this is accomplished is key to both performance and usability. This paper presents the Sustainable Staging Transport, an approach which allows direct streaming between traditional file writers and readers with few application changes. SST is an ADIOS "engine", accessible via standard ADIOS APIs, and because ADIOS allows engines to be chosen at run-time, many existing file-oriented ADIOS workflows can utilize SST for direct application-to-application communication without any source code changes. This paper describes the design of SST and presents performance results from various applications that use SST, for feeding model training with simulation data with substantially higher bandwidth than the theoretical limits of Frontier's file system, for strong coupling of separately developed applications for multiphysics multiscale simulation, or for in situ analysis and visualization of data to complete all data processing shortly after the simulation finishes.
△ Less
Submitted 30 September, 2024;
originally announced October 2024.