Showing 1–2 of 2 results for author: Väisälä, M S

Search v0.5.6 released 2020-02-24

arXiv:2103.01597 [pdf, ps, other]

cs.DC physics.comp-ph physics.flu-dyn

doi 10.1016/j.parco.2022.102904

Scalable communication for high-order stencil computations using CUDA-aware MPI

Authors: Johannes Pekkilä, Miikka S. Väisälä, Maarit J. Käpylä, Matthias Rheinhardt, Oskar Lappi

Abstract: Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accent… ▽ More Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. Our GPU implementation scales strongly from one to $64$ devices at $50\%$--$87\%$ of the expected efficiency based on a theoretical performance model. Compared with a multi-core CPU solver, our implementation exhibits $20$--$60\times$ speedup and $9$--$12\times$ improved energy efficiency in compute-bound benchmarks on $16$ nodes. △ Less

Submitted 10 May, 2022; v1 submitted 2 March, 2021; originally announced March 2021.

Comments: 15 pages, 15 figures. Updated with the accepted manuscript. More extensive tests added and wording clarified in several places. Please refer to the published article for the most polished version

Journal ref: Parallel Computing, Volume 111, 2022, 102904
arXiv:1707.08900 [pdf, ps, other]

physics.comp-ph astro-ph.IM cs.DC physics.flu-dyn

doi 10.1016/j.cpc.2017.03.011

Methods for compressible fluid simulation on GPUs using high-order finite differences

Authors: Johannes Pekkilä, Miikka S. Väisälä, Maarit J. Käpylä, Petri J. Käpylä, Omer Anjum

Abstract: We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the cac… ▽ More We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the caches of the GPU. We present two approaches for simulating compressible fluids using 55-point and 19-point stencils. We seek to reduce the requirements for memory bandwidth and cache size in our methods by using cache blocking and decomposing a latency-bound kernel into several bandwidth-bound kernels. Our fastest implementation is bandwidth-bound and integrates $343$ million grid points per second on a Tesla K40t GPU, achieving a $3.6 \times$ speedup over a comparable hydrodynamics solver benchmarked on two Intel Xeon E5-2690v3 processors. Our alternative GPU implementation is latency-bound and achieves the rate of $168$ million updates per second. △ Less

Submitted 27 July, 2017; originally announced July 2017.

Comments: 14 pages, 7 figures

Journal ref: Computer Physics Communications, Volume 217, August 2017, Pages 11-22

Search v0.5.6 released 2020-02-24