System-level Scalable Checkpoint-Restart for Petascale Computing
Authors:
Jiajun Cao,
Kapil Arya,
Rohan Garg,
Shawn Matott,
Dhabaleswar K. Panda,
Hari Subramoni,
Jérôme Vienne,
Gene Cooperman
Abstract:
Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that…
▽ More
Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that InfiniBand UD is required to support modern MPI implementations. An extrapolation from the current results to future SSD-based storage systems provides evidence that the current approach will remain practical in the exascale generation. This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package. Results are shown for HPCG (linear algebra), NAMD (molecular dynamics), and the NAS NPB benchmarks. In tests up to 32,752 MPI processes on 32,752 CPU cores, checkpointing of a computation with a 38 TB memory footprint in 11 minutes is demonstrated. Runtime overhead is reduced to less than 1%. The approach is also evaluated across three widely used MPI implementations.
△ Less
Submitted 23 September, 2016; v1 submitted 27 July, 2016;
originally announced July 2016.
Solving the Klein-Gordon equation using Fourier spectral methods: A benchmark test for computer performance
Authors:
S. Aseeri,
O. Batrašev,
M. Icardi,
B. Leu,
A. Liu,
N. Li,
B. K. Muite,
E. Müller,
B. Palen,
M. Quell,
H. Servat,
P. Sheth,
R. Speck,
M. Van Moer,
J. Vienne
Abstract:
The cubic Klein-Gordon equation is a simple but non-trivial partial differential equation whose numerical solution has the main building blocks required for the solution of many other partial differential equations. In this study, the library 2DECOMP&FFT is used in a Fourier spectral scheme to solve the Klein-Gordon equation and strong scaling of the code is examined on thirteen different machines…
▽ More
The cubic Klein-Gordon equation is a simple but non-trivial partial differential equation whose numerical solution has the main building blocks required for the solution of many other partial differential equations. In this study, the library 2DECOMP&FFT is used in a Fourier spectral scheme to solve the Klein-Gordon equation and strong scaling of the code is examined on thirteen different machines for a problem size of 512^3. The results are useful in assessing likely performance of other parallel fast Fourier transform based programs for solving partial differential equations. The problem is chosen to be large enough to solve on a workstation, yet also of interest to solve quickly on a supercomputer, in particular for parametric studies. Unlike other high performance computing benchmarks, for this problem size, the time to solution will not be improved by simply building a bigger supercomputer.
△ Less
Submitted 19 January, 2015;
originally announced January 2015.