-
A Distributed-memory Tridiagonal Solver Based on a Specialised Data Structure Optimised for CPU and GPU Architectures
Authors:
Semih Akkurt,
Sébastien Lemaire,
Paul Bartholomew,
Sylvain Laizet
Abstract:
Various numerical methods used for solving partial differential equations (PDE) result in tridiagonal systems. Solving tridiagonal systems on distributed-memory environments is not straightforward, and often requires significant amount of communication. In this article, we present a novel distributed-memory tridiagonal solver algorithm, DistD2-TDS, based on a specialised data structure. DistD2-TDS…
▽ More
Various numerical methods used for solving partial differential equations (PDE) result in tridiagonal systems. Solving tridiagonal systems on distributed-memory environments is not straightforward, and often requires significant amount of communication. In this article, we present a novel distributed-memory tridiagonal solver algorithm, DistD2-TDS, based on a specialised data structure. DistD2-TDS algorithm takes advantage of the diagonal dominance in tridiagonal systems to reduce the communications in distributed-memory environments. The underlying data structure plays a crucial role for the performance of the algorithm. First, the data structure improves data localities and makes it possible to minimise data movements via cache blocking and kernel fusion strategies. Second, data continuity enables a contiguous data access pattern and results in efficient utilisation of the available memory bandwidth. Finally, the data layout supports vectorisation on CPUs and thread level parallelisation on GPUs for improved performance. In order to demonstrate the robustness of the algorithm, we implemented and benchmarked the algorithm on CPUs and GPUs. We investigated the single rank performance and compared against existing algorithms. Furthermore, we analysed the strong scaling of the implementation up to 384 NVIDIA H100 GPUs and up to 8192 AMD EPYC 7742 CPUs. Finally, we demonstrated a practical use case of the algorithm by using compact finite difference schemes to solve a 3D non-linear PDE. The results demonstrate that DistD2 algorithm can sustain around 66% of the theoretical peak bandwidth at scale on CPU and GPU based supercomputers.
△ Less
Submitted 20 November, 2024;
originally announced November 2024.
-
PyFR v2.0.3: Towards Industrial Adoption of Scale-Resolving Simulations
Authors:
Freddie D. Witherden,
Peter E. Vincent,
Will Trojak,
Yoshiaki Abe,
Amir Akbarzadeh,
Semih Akkurt,
Mohammad Alhawwary,
Lidia Caros,
Tarik Dzanic,
Giorgio Giangaspero,
Arvind S. Iyer,
Antony Jameson,
Marius Koch,
Niki Loppi,
Sambit Mishra,
Rishit Modi,
Gonzalo Sáez-Mischlich,
Jin Seok Park,
Brian C. Vermeire,
Lai Wang
Abstract:
PyFR is an open-source cross-platform computational fluid dynamics framework based on the high-order Flux Reconstruction approach, specifically designed for undertaking high-accuracy scale-resolving simulations in the vicinity of complex engineering geometries. Since the initial release of PyFR v0.1.0 in 2013, a range of new capabilities have been added to the framework, with a view to enabling in…
▽ More
PyFR is an open-source cross-platform computational fluid dynamics framework based on the high-order Flux Reconstruction approach, specifically designed for undertaking high-accuracy scale-resolving simulations in the vicinity of complex engineering geometries. Since the initial release of PyFR v0.1.0 in 2013, a range of new capabilities have been added to the framework, with a view to enabling industrial adoption of the capability. This paper provides details of those enhancements as released in PyFR v2.0.3, explains efforts to grow an engaged developer and user community, and provides latest performance and scaling results on up to 1024 AMD Instinct MI250X accelerators of Frontier at ORNL (each with two GCDs), and up to 2048 NVIDIA GH200 GPUs on Alps at CSCS.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
Cache Blocking for Flux Reconstruction: Extension to Navier-Stokes Equations and Anti-aliasing
Authors:
Semih Akkurt,
Freddie Witherden,
Peter Vincent
Abstract:
In this article, cache blocking is implemented for the Navier Stokes equations with anti-aliasing support on mixed grids in PyFR for CPUs. In particular, cache blocking is used as an alternative to kernel fusion to eliminate unnecessary data movements between kernels at the main memory level. Specifically, kernels that exchange data are grouped together, and these groups are then executed on small…
▽ More
In this article, cache blocking is implemented for the Navier Stokes equations with anti-aliasing support on mixed grids in PyFR for CPUs. In particular, cache blocking is used as an alternative to kernel fusion to eliminate unnecessary data movements between kernels at the main memory level. Specifically, kernels that exchange data are grouped together, and these groups are then executed on small sub-regions of the domain that fit in per-core private data cache. Additionally, cache blocking is also used to efficiently implement a tensor product factorisation of the interpolation operators associated with anti-aliasing. By using cache blocking, the intermediate results between application of the sparse factors are stored in per-core private data cache, and a significant amount of data movement from main memory is avoided. In order to assess the performance gains a theoretical model is developed, and the implementation is benchmarked using a compressible 3D Taylor-Green vortex test case on both hexahedral and prismatic grids, with third- and forth-order solution polynomials. The expected performance gains based on the theoretical model range from 1.99 to 2.62, and the speedups obtained in practice range from 1.67 to 3.67 compared to PyFR v1.11.0.
△ Less
Submitted 6 November, 2023;
originally announced January 2024.