-
GPU Algorithms for Efficient Exascale Discretizations
Authors:
Ahmad Abdelfattah,
Valeria Barra,
Natalie Beams,
Ryan Bleile,
Jed Brown,
Jean-Sylvain Camier,
Robert Carson,
Noel Chalmers,
Veselin Dobrev,
Yohann Dudouit,
Paul Fischer,
Ali Karakus,
Stefan Kerkemeier,
Tzanio Kolev,
Yu-Hsiang Lan,
Elia Merzari,
Misun Min,
Malachi Phillips,
Thilina Rathnayake,
Robert Rieben,
Thomas Stitt,
Ananias Tomboulides,
Stanimire Tomov,
Vladimir Tomov,
Arturo Vargas
, et al. (2 additional authors not shown)
Abstract:
In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM,…
▽ More
In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems.
△ Less
Submitted 10 September, 2021;
originally announced September 2021.
-
Efficient Exascale Discretizations: High-Order Finite Element Methods
Authors:
Tzanio Kolev,
Paul Fischer,
Misun Min,
Jack Dongarra,
Jed Brown,
Veselin Dobrev,
Tim Warburton,
Stanimire Tomov,
Mark S. Shephard,
Ahmad Abdelfattah,
Valeria Barra,
Natalie Beams,
Jean-Sylvain Camier,
Noel Chalmers,
Yohann Dudouit,
Ali Karakus,
Ian Karlin,
Stefan Kerkemeier,
Yu-Hsiang Lan,
David Medina,
Elia Merzari,
Aleksandr Obabko,
Will Pazner,
Thilina Rathnayake,
Cameron W. Smith
, et al. (5 additional authors not shown)
Abstract:
Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms used in many large-scale applications. These architectures favor algorithms that expose ultra fine-grain parallelism and maximize the ratio of floating point operations to energy intensive data movement. One of the few viable approaches to achieve high efficiency in the area of PDE discretizations on u…
▽ More
Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms used in many large-scale applications. These architectures favor algorithms that expose ultra fine-grain parallelism and maximize the ratio of floating point operations to energy intensive data movement. One of the few viable approaches to achieve high efficiency in the area of PDE discretizations on unstructured grids is to use matrix-free/partially-assembled high-order finite element methods, since these methods can increase the accuracy and/or lower the computational time due to reduced data motion. In this paper we provide an overview of the research and development activities in the Center for Efficient Exascale Discretizations (CEED), a co-design center in the Exascale Computing Project that is focused on the development of next-generation discretization software and algorithms to enable a wide range of finite element applications to run efficiently on future hardware. CEED is a research partnership involving more than 30 computational scientists from two US national labs and five universities, including members of the Nek5000, MFEM, MAGMA and PETSc projects. We discuss the CEED co-design activities based on targeted benchmarks, miniapps and discretization libraries and our work on performance optimizations for large-scale GPU architectures. We also provide a broad overview of research and development activities in areas such as unstructured adaptive mesh refinement algorithms, matrix-free linear solvers, high-order data visualization, and list examples of collaborations with several ECP and external applications.
△ Less
Submitted 10 September, 2021;
originally announced September 2021.
-
A Local Discontinuous Galerkin Level Set Reinitialization with Subcell Stabilization on Unstructured Meshes
Authors:
Ali Karakus,
Noel Chalmers,
Tim Warburton
Abstract:
In this paper we consider a level set reinitialization technique based on a high-order, local discontinuous Galerkin method on unstructured triangular meshes. A finite volume based subcell stabilization is used to improve the nonlinear stability of the method. Instead of the standard hyperbolic level set reinitialization, the flow of time Eikonal equation is discretized to construct an approximate…
▽ More
In this paper we consider a level set reinitialization technique based on a high-order, local discontinuous Galerkin method on unstructured triangular meshes. A finite volume based subcell stabilization is used to improve the nonlinear stability of the method. Instead of the standard hyperbolic level set reinitialization, the flow of time Eikonal equation is discretized to construct an approximate signed distance function. Using the Eikonal equation removes the regularization parameter in the standard approach which allows more predictable behavior and faster convergence speeds around the interface. This makes our approach very efficient especially for banded level set formulations. A set of numerical experiments including both smooth and non-smooth interfaces indicate that the method experimentally achieves design order accuracy.
△ Less
Submitted 18 August, 2021;
originally announced August 2021.
-
Discontinuous Galerkin Discretizations of the Boltzmann Equations in 2D: semi-analytic time stepping and absorbing boundary layers
Authors:
A. Karakus,
N. Chalmers,
J. S. Hesthaven,
T. Warburton
Abstract:
We present an efficient nodal discontinuous Galerkin method for approximating nearly incompressible flows using the Boltzmann equations. The equations are discretized with Hermite polynomials in velocity space yielding a first order conservation law. A stabilized unsplit perfectly matching layer (PML) formulation is introduced for the resulting nonlinear flow equations. The proposed PML equations…
▽ More
We present an efficient nodal discontinuous Galerkin method for approximating nearly incompressible flows using the Boltzmann equations. The equations are discretized with Hermite polynomials in velocity space yielding a first order conservation law. A stabilized unsplit perfectly matching layer (PML) formulation is introduced for the resulting nonlinear flow equations. The proposed PML equations exponentially absorb the difference between the nonlinear fluctuation and the prescribed mean flow. We introduce semi-analytic time discretization methods to improve the time step restrictions in small relaxation times. We also introduce a multirate semi-analytic Adams-Bashforth method which preserves efficiency in stiff regimes. Accuracy and performance of the method are tested using distinct cases including isothermal vortex, flow around square cylinder, and wall mounted square cylinder test cases.
△ Less
Submitted 5 May, 2018;
originally announced May 2018.
-
A GPU Accelerated Discontinuous Galerkin Incompressible Flow Solver
Authors:
Ali Karakus,
Noel Chalmers,
Kasia Swirydowicz,
Timothy Warburton
Abstract:
We present a GPU-accelerated version of a high-order discontinuous Galerkin discretization of the unsteady incompressible Navier-Stokes equations. The equations are discretized in time using a semi-implicit scheme with explicit treatment of the nonlinear term and implicit treatment of the split Stokes operators. The pressure system is solved with a conjugate gradient method together with a fully G…
▽ More
We present a GPU-accelerated version of a high-order discontinuous Galerkin discretization of the unsteady incompressible Navier-Stokes equations. The equations are discretized in time using a semi-implicit scheme with explicit treatment of the nonlinear term and implicit treatment of the split Stokes operators. The pressure system is solved with a conjugate gradient method together with a fully GPU-accelerated multigrid preconditioner which is designed to minimize memory requirements and to increase overall performance. A semi-Lagrangian subcycling advection algorithm is used to shift the computational load per timestep away from the pressure Poisson solve by allowing larger timestep sizes in exchange for an increased number of advection steps. Numerical results confirm we achieve the design order accuracy in time and space. We optimize the performance of the most time-consuming kernels by tuning the fine-grain parallelism, memory utilization, and maximizing bandwidth. To assess overall performance we present an empirically calibrated roofline performance model for a target GPU to explain the achieved efficiency. We demonstrate that, in the most cases, the kernels used in the solver are close to their empirically predicted roofline performance.
△ Less
Submitted 7 May, 2018; v1 submitted 31 December, 2017;
originally announced January 2018.
-
Acceleration of tensor-product operations for high-order finite element methods
Authors:
Kasia Ćwirydowicz,
Noel Chalmers,
Ali Karakus,
Timothy Warburton
Abstract:
This paper is devoted to GPU kernel optimization and performance analysis of three tensor-product operators arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close-to-the-peak performance for these operators requires extensive optimization because of the operators' properties: low arithmetic intensity, tiered structure…
▽ More
This paper is devoted to GPU kernel optimization and performance analysis of three tensor-product operators arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close-to-the-peak performance for these operators requires extensive optimization because of the operators' properties: low arithmetic intensity, tiered structure, and the need to store intermediate results inside the kernel. We give a guided overview of optimization strategies and we present a performance model that allows us to compare the efficacy of these optimizations against an empirically calibrated roofline.
△ Less
Submitted 13 November, 2017; v1 submitted 2 November, 2017;
originally announced November 2017.