Search | arXiv e-print repository

arXiv:2504.06699 [pdf, other]

Benchmarking Convolutional Neural Network and Graph Neural Network based Surrogate Models on a Real-World Car External Aerodynamics Dataset

Authors: Sam Jacob Jacob, Markus Mrosek, Carsten Othmer, Harald Köstler

Abstract: Aerodynamic optimization is crucial for developing eco-friendly, aerodynamic, and stylish cars, which requires close collaboration between aerodynamicists and stylists, a collaboration impaired by the time-consuming nature of aerodynamic simulations. Surrogate models offer a viable solution to reduce this overhead, but they are untested in real-world aerodynamic datasets. We present a comparative… ▽ More Aerodynamic optimization is crucial for developing eco-friendly, aerodynamic, and stylish cars, which requires close collaboration between aerodynamicists and stylists, a collaboration impaired by the time-consuming nature of aerodynamic simulations. Surrogate models offer a viable solution to reduce this overhead, but they are untested in real-world aerodynamic datasets. We present a comparative evaluation of two surrogate modeling approaches for predicting drag on a real-world dataset: a Convolutional Neural Network (CNN) model that uses a signed distance field as input and a commercial tool based on Graph Neural Networks (GNN) that directly processes a surface mesh. In contrast to previous studies based on datasets created from parameterized geometries, our dataset comprises 343 geometries derived from 32 baseline vehicle geometries across five distinct car projects, reflecting the diverse, free-form modifications encountered in the typical vehicle development process. Our results show that the CNN-based method achieves a mean absolute error of 2.3 drag counts, while the GNN-based method achieves 3.8. Both methods achieve approximately 77% accuracy in predicting the direction of drag change relative to the baseline geometry. While both methods effectively capture the broader trends between baseline groups (set of samples derived from a single baseline geometry), they struggle to varying extents in capturing the finer intra-baseline group variations. In summary, our findings suggest that aerodynamicists can effectively use both methods to predict drag in under two minutes, which is at least 600 times faster than performing a simulation. However, there remains room for improvement in capturing the finer details of the geometry. △ Less

Submitted 9 April, 2025; originally announced April 2025.

arXiv:2502.20049 [pdf, other]

Large-Scale Simulations of Fully Resolved Complex Moving Geometries with Partially Saturated Cells

Authors: P. Suffa, S. Kemmler, H. Koestler, U. Ruede

Abstract: We employ the Partially Saturated Cells Method (PSM) to model the interaction between the fluid flow and solid moving objects as an extension to the conventional lattice Boltzmann method. We introduce an efficient and accurate method for mapping complex moving geometries onto uniform Cartesian grids suitable for massively parallel processing. A validation of the physical accuracy of the solid-flui… ▽ More We employ the Partially Saturated Cells Method (PSM) to model the interaction between the fluid flow and solid moving objects as an extension to the conventional lattice Boltzmann method. We introduce an efficient and accurate method for mapping complex moving geometries onto uniform Cartesian grids suitable for massively parallel processing. A validation of the physical accuracy of the solid-fluid coupling and the proposed mapping of complex geometries ispresented. The implementation is integrated into the code generation pipeline of the waLBerla framework so that highly optimized kernels for CPU and GPU architectures become available. We study the node-level performance of the automatically generated solver routines. 71% of the peak performance can be achieved on CPU nodes and 86% on GPU accelerated nodes. Only a moderate overhead is observed for the processing of the solid-fluid coupling when compared to the fluids simulations without moving objects. Finally, a counter-rotating rotor is presented as a prototype industrial scenario, resulting in a mesh size involving up to 4.3 billion fluid grid cells. For this scenario, excellent parallel efficiency is reported in a strong scaling study on up to 32,768 CPU cores on the LUMI-C supercomputer and on up to 1,024 NVIDIA A100 GPUs on the JUWELS Booster system. △ Less

Submitted 27 February, 2025; originally announced February 2025.

Comments: 13 pages, 16 figures

arXiv:2412.08186 [pdf, other]

Towards Automated Algebraic Multigrid Preconditioner Design Using Genetic Programming for Large-Scale Laser Beam Welding Simulations

Authors: Dinesh Parthasarathy, Tommaso Bevilacqua, Martin Lanser, Axel Klawonn, Harald Köstler

Abstract: Multigrid methods are asymptotically optimal algorithms ideal for large-scale simulations. But, they require making numerous algorithmic choices that significantly influence their efficiency. Unlike recent approaches that learn optimal multigrid components using machine learning techniques, we adopt a complementary strategy here, employing evolutionary algorithms to construct efficient multigrid c… ▽ More Multigrid methods are asymptotically optimal algorithms ideal for large-scale simulations. But, they require making numerous algorithmic choices that significantly influence their efficiency. Unlike recent approaches that learn optimal multigrid components using machine learning techniques, we adopt a complementary strategy here, employing evolutionary algorithms to construct efficient multigrid cycles from available individual components. This technology is applied to finite element simulations of the laser beam welding process. The thermo-elastic behavior is described by a coupled system of time-dependent thermo-elasticity equations, leading to nonlinear and ill-conditioned systems. The nonlinearity is addressed using Newton's method, and iterative solvers are accelerated with an algebraic multigrid (AMG) preconditioner using hypre BoomerAMG interfaced via PETSc. This is applied as a monolithic solver for the coupled equations. To further enhance solver efficiency, flexible AMG cycles are introduced, extending traditional cycle types with level-specific smoothing sequences and non-recursive cycling patterns. These are automatically generated using genetic programming, guided by a context-free grammar containing AMG rules. Numerical experiments demonstrate the potential of these approaches to improve solver performance in large-scale laser beam welding simulations. △ Less

Submitted 11 December, 2024; originally announced December 2024.

MSC Class: 65M55 (Primary) 74F05; 65M60 (Secondary) ACM Class: I.2.2; G.1.8; J.2

arXiv:2412.05852 [pdf, other]

Evolving Algebraic Multigrid Methods Using Grammar-Guided Genetic Programming

Authors: Dinesh Parthasarathy, Wayne Bradford Mitchell, Harald Köstler

Abstract: Multigrid methods despite being known to be asymptotically optimal algorithms, depend on the careful selection of their individual components for efficiency. Also, they are mostly restricted to standard cycle types like V-, F-, and W-cycles. We use grammar rules to generate arbitrary-shaped cycles, wherein the smoothers and their relaxation weights are chosen independently at each step within the… ▽ More Multigrid methods despite being known to be asymptotically optimal algorithms, depend on the careful selection of their individual components for efficiency. Also, they are mostly restricted to standard cycle types like V-, F-, and W-cycles. We use grammar rules to generate arbitrary-shaped cycles, wherein the smoothers and their relaxation weights are chosen independently at each step within the cycle. We call this a flexible multigrid cycle. These flexible cycles are used in Algebraic Multigrid (AMG) methods with the help of grammar rules and optimized using genetic programming. The flexible AMG methods are implemented in the software library of hypre, and the programs are optimized separately for two cases: a standalone AMG solver for a 3D anisotropic problem and an AMG preconditioner with conjugate gradient for a multiphysics code. We observe that the optimized flexible cycles provide higher efficiency and better performance than the standard cycle types. △ Less

Submitted 8 December, 2024; originally announced December 2024.

arXiv:2409.07203 [pdf]

Phantom-based gradient waveform measurements with compensated variable-prephasing: Description and application to EPI at 7T

Authors: Hannah Scholten, Tobias Wech, Istvan Homolya, Herbert Köstler

Abstract: Purpose: Introducing "compensated variable-prephasing" (CVP), a phantom-based method for gradient waveform measurements. The technique is based on the "variable-prephasing" (VP) method, but takes into account the effects of all gradients involved in the measurement. Methods: We conducted measurements of a trapezoidal test gradient, and of an EPI readout gradient train with three approaches: VP,… ▽ More Purpose: Introducing "compensated variable-prephasing" (CVP), a phantom-based method for gradient waveform measurements. The technique is based on the "variable-prephasing" (VP) method, but takes into account the effects of all gradients involved in the measurement. Methods: We conducted measurements of a trapezoidal test gradient, and of an EPI readout gradient train with three approaches: VP, CVP, and "fully compensated variable-prephasing" (FCVP). We compared them to one another and to predictions based on the gradient system transfer function. Furthermore, we used the measured and predicted EPI gradients for trajectory corrections in phantom images on a 7T scanner. Results: The VP gradient measurements are confounded by lingering oscillations of the prephasing gradients, which are compensated in the CVP and FCVP measurements. FCVP is vulnerable to a sign asymmetry in the gradient chain. However, the trajectories determined by all three methods resulted in comparably high EPI image quality. Conclusion: We present a new approach allowing for phantom-based gradient waveform measurements with high precision, which can be useful for trajectory corrections in non-Cartesian or single-shot imaging techniques. In our experimental setup, the proposed "compensated variable-prephasing" method provided the most reliable gradient measurements of the different techniques we compared. △ Less

Submitted 11 September, 2024; originally announced September 2024.

Comments: 24 pages, 5 figures

arXiv:2408.06880 [pdf, other]

Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries

Authors: Philipp Suffa, Markus Holzer, Harald Köstler, Ulrich Rüde

Abstract: We implement and analyse a sparse / indirect-addressing data structure for the Lattice Boltzmann Method to support efficient compute kernels for fluid dynamics problems with a high number of non-fluid nodes in the domain, such as in porous media flows. The data structure is integrated into a code generation pipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils and collisio… ▽ More We implement and analyse a sparse / indirect-addressing data structure for the Lattice Boltzmann Method to support efficient compute kernels for fluid dynamics problems with a high number of non-fluid nodes in the domain, such as in porous media flows. The data structure is integrated into a code generation pipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils and collision operators and to generate efficient code for kernels for CPU as well as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels with an in-place streaming pattern to save memory accesses and memory consumption and we implement a communication hiding technique to prove scalability. We present single GPU performance results with up to 99% of maximal bandwidth utilization. We integrate the optimized generated kernels in the high performance framework WALBERLA and achieve a scaling efficiency of at least 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on modern HPC systems. Further, we set up three different applications to test the sparse data structure for realistic demonstrator problems. We show performance results for flow through porous media, free flow over a particle bed, and blood flow in a coronary artery. We achieve a maximal performance speed-up of 2 and a significantly reduced memory consumption by up to 75% with the sparse / indirect-addressing data structure compared to the direct-addressing data structure for these applications. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: 16 pages, 19 figures

arXiv:2404.08371 [pdf, other]

Code Generation and Performance Engineering for Matrix-Free Finite Element Methods on Hybrid Tetrahedral Grids

Authors: Fabian Böhm, Daniel Bauer, Nils Kohl, Christie Alappat, Dominik Thönnes, Marcus Mohr, Harald Köstler, Ulrich Rüde

Abstract: This paper introduces a code generator designed for node-level optimized, extreme-scalable, matrix-free finite element operators on hybrid tetrahedral grids. It optimizes the local evaluation of bilinear forms through various techniques including tabulation, relocation of loop invariants, and inter-element vectorization - implemented as transformations of an abstract syntax tree. A key contributio… ▽ More This paper introduces a code generator designed for node-level optimized, extreme-scalable, matrix-free finite element operators on hybrid tetrahedral grids. It optimizes the local evaluation of bilinear forms through various techniques including tabulation, relocation of loop invariants, and inter-element vectorization - implemented as transformations of an abstract syntax tree. A key contribution is the development, analysis, and generation of efficient loop patterns that leverage the local structure of the underlying tetrahedral grid. These significantly enhance cache locality and arithmetic intensity, mitigating bandwidth-pressure associated with compute-sparse, low-order operators. The paper demonstrates the generator's capabilities through a comprehensive educational cycle of performance analysis, bottleneck identification, and emission of dedicated optimizations. For three differential operators ($-Δ$, $-\nabla \cdot (k(\mathbf{x})\, \nabla\,)$, $α(\mathbf{x})\, \mathbf{curl}\ \mathbf{curl} + β(\mathbf{x}) $), we determine the set of most effective optimizations. Applied by the generator, they result in speed-ups of up to 58$\times$ compared to reference implementations. Detailed node-level performance analysis yields matrix-free operators with a throughput of 1.3 to 2.1 GDoF/s, achieving up to 62% peak performance on a 36-core Intel Ice Lake socket. Finally, the solution of the curl-curl problem with more than a trillion ($ 10^{12}$) degrees of freedom on 21504 processes in less than 50 seconds demonstrates the generated operators' performance and extreme-scalability as part of a full multigrid solver. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 22 pages

MSC Class: 65F50; 65N30; 65N55; 65Y20; 65F10

arXiv:2403.08063 [pdf, other]

Towards Code Generation for Octree-Based Multigrid Solvers

Authors: Richard Angersbach, Sebastian Kuckuck, Harald Köstler

Abstract: This paper presents a novel method designed to generate multigrid solvers optimized for octree-based software frameworks. Our approach focuses on accurately capturing local features within a domain while leveraging the efficiency inherent in multigrid techniques. We outline the essential steps involved in generating specialized kernels for local refinement and communication routines, integrating o… ▽ More This paper presents a novel method designed to generate multigrid solvers optimized for octree-based software frameworks. Our approach focuses on accurately capturing local features within a domain while leveraging the efficiency inherent in multigrid techniques. We outline the essential steps involved in generating specialized kernels for local refinement and communication routines, integrating on-the-fly interpolations to seamlessly transfer information between refinement levels. For this purpose, we established a software coupling via an automatic fusion of generated multigrid solvers and communication kernels with manual implementations of complex octree data structures and algorithms often found in established software frameworks. We demonstrate the effectiveness of our method through numerical experiments with different interpolation orders. Large-scale benchmarks conducted on the SuperMUC-NG CPU cluster underscore the advantages of our approach, offering a comparison against a reference implementation to highlight the benefits of our method and code generation in general. △ Less

Submitted 6 May, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.01579 [pdf, other]

doi 10.1080/17445760.2024.2360190

A Continuous Benchmarking Infrastructure for High-Performance Computing Applications

Authors: Christoph Alt, Martin Lanser, Jonas Plewinski, Atin Janki, Axel Klawonn, Harald Köstler, Michael Selzer, Ulrich Rüde

Abstract: For scientific software, especially those used for large-scale simulations, achieving good performance and efficiently using the available hardware resources is essential. It is important to regularly perform benchmarks to ensure the efficient use of hardware and software when systems are changing and the software evolves. However, this can become quickly very tedious when many options for paramet… ▽ More For scientific software, especially those used for large-scale simulations, achieving good performance and efficiently using the available hardware resources is essential. It is important to regularly perform benchmarks to ensure the efficient use of hardware and software when systems are changing and the software evolves. However, this can become quickly very tedious when many options for parameters, solvers, and hardware architectures are available. We present a continuous benchmarking strategy that automates benchmarking new code changes on high-performance computing clusters. This makes it possible to track how each code change affects the performance and how it evolves. △ Less

Submitted 3 March, 2024; originally announced March 2024.

Journal ref: International Journal of Parallel, Emergent & Distributed Systems, 2024

arXiv:2402.13171 [pdf, other]

doi 10.1002/cpe.8117

waLBerla-wind: a lattice-Boltzmann-based high-performance flow solver for wind energy applications

Authors: Helen Schottenhamml, Ani Anciaux-Sedrakian, Frédéric Blondel, Harald Köstler, Ulrich Rüde

Abstract: This article presents the development of a new wind turbine simulation software to study wake flow physics. To this end, the design and development of waLBerla-wind, a new simulator based on the lattice-Boltzmann method that is known for its excellent performance and scaling properties, will be presented. Here it will be used for large eddy simulations (LES) coupled with actuator wind turbine mode… ▽ More This article presents the development of a new wind turbine simulation software to study wake flow physics. To this end, the design and development of waLBerla-wind, a new simulator based on the lattice-Boltzmann method that is known for its excellent performance and scaling properties, will be presented. Here it will be used for large eddy simulations (LES) coupled with actuator wind turbine models. Due to its modular software design, waLBerla-wind is flexible and extensible with regard to turbine configurations. Additionally it is performance portable across different hardware architectures, another critical design goal. The new solver is validated by presenting force distributions and velocity profiles and comparing them with experimental data and a vortex solver. Furthermore, waLBerla-wind's performance is \revision{compared to a theoretical peak performance}, and analysed with weak and strong scaling benchmarks on CPU and GPU systems. This analysis demonstrates the suitability for large-scale applications and future cost-effective full wind farm simulations. △ Less

Submitted 8 December, 2023; originally announced February 2024.

Journal ref: Concurrency Computat Pract Exper. 2024;e8117

arXiv:2311.11348 [pdf, other]

p-adaptive discontinuous Galerkin method for the shallow water equations on heterogeneous computing architectures

Authors: Sara Faghih-Naini, Vadym Aizinger, Sebastian Kuckuk, Richard Angersbach, Harald Köstler

Abstract: Heterogeneous computing and exploiting integrated CPU-GPU architectures has become a clear current trend since the flattening of Moore's Law. In this work, we propose a numerical and algorithmic re-design of a p-adaptive quadrature-free discontinuous Galerkin method (DG) for the shallow water equations (SWE). Our new approach separates the computations of the non-adaptive (lower-order) and adaptiv… ▽ More Heterogeneous computing and exploiting integrated CPU-GPU architectures has become a clear current trend since the flattening of Moore's Law. In this work, we propose a numerical and algorithmic re-design of a p-adaptive quadrature-free discontinuous Galerkin method (DG) for the shallow water equations (SWE). Our new approach separates the computations of the non-adaptive (lower-order) and adaptive (higher-order) parts of the discretization form each other. Thereby, we can overlap computations of the lower-order and the higher-order DG solution components. Furthermore, we investigate execution times of main computational kernels and use automatic code generation to optimize their distribution between the CPU and GPU. Several setups, including a prototype of a tsunami simulation in a tide-driven flow scenario, are investigated, and the results show that significant performance improvements can be achieved in suitable setups. △ Less

Submitted 19 November, 2023; originally announced November 2023.

arXiv:2307.01594 [pdf]

A pre-emphasis based on the gradient system transfer function reduces steady-state disruptions in bSSFP imaging caused by residual gradients

Authors: Hannah Scholten, Herbert Köstler, Anne Slawig

Abstract: Purpose: To examine whether an advanced gradient pre-emphasis approach based on the gradient system transfer function (GSTF) can mitigate artifacts caused by residual unbalanced gradients in Cartesian balanced steady-state free precession (bSSFP) imaging with non-linear line-ordering. Theory and Methods: We implemented a gradient pre-emphasis based on the GSTF for bSSFP sequences with linear, ce… ▽ More Purpose: To examine whether an advanced gradient pre-emphasis approach based on the gradient system transfer function (GSTF) can mitigate artifacts caused by residual unbalanced gradients in Cartesian balanced steady-state free precession (bSSFP) imaging with non-linear line-ordering. Theory and Methods: We implemented a gradient pre-emphasis based on the GSTF for bSSFP sequences with linear, centric and quasi-random ordering of the phase-encoding steps. Signal-, noise- and artifact levels were determined in phantom experiments. Furthermore, we simulated the phase accumulating in every TR interval of a Cartesian bSSFP sequence for the three different line-ordering schemes. Results: The simulations showed that the phase contribution arising from residual unbalanced phase-encoding gradients are the principal cause of steady-state disruptions in our sequence. In the phantom experiments, the GSTF-based gradient pre-emphasis approach reduced the artifact level in bSSFP images with non-linear line-ordering considerably. Compared to the linearly ordered measurement, the relative artifact intensity difference dropped by up to 89 %. Conclusion: A GSTF-based pre-emphasis approach can successfully mitigate residual unbalanced gradient artifacts in bSSFP imaging with non-linear line-ordering. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: 20 pages with 6 figures

arXiv:2306.10080 [pdf, ps, other]

AI Driven Near Real-time Locational Marginal Pricing Method: A Feasibility and Robustness Study

Authors: Naga Venkata Sai Jitin Jami, Juraj Kardoš, Olaf Schenk, Harald Köstler

Abstract: Accurate price predictions are essential for market participants in order to optimize their operational schedules and bidding strategies, especially in the current context where electricity prices become more volatile and less predictable using classical approaches. The Locational Marginal Pricing (LMP) pricing mechanism is used in many modern power markets, where the traditional approach utilizes… ▽ More Accurate price predictions are essential for market participants in order to optimize their operational schedules and bidding strategies, especially in the current context where electricity prices become more volatile and less predictable using classical approaches. The Locational Marginal Pricing (LMP) pricing mechanism is used in many modern power markets, where the traditional approach utilizes optimal power flow (OPF) solvers. However, for large electricity grids this process becomes prohibitively time-consuming and computationally intensive. Machine learning (ML) based predictions could provide an efficient tool for LMP prediction, especially in energy markets with intermittent sources like renewable energy. This study evaluates the performance of popular machine learning and deep learning models in predicting LMP on multiple electricity grids. The accuracy and robustness of these models in predicting LMP is assessed considering multiple scenarios. The results show that ML models can predict LMP 4-5 orders of magnitude faster than traditional OPF solvers with 5-6\% error rate, highlighting the potential of ML models in LMP prediction for large-scale power models with the assistance of hardware infrastructure like multi-core CPUs and GPUs in modern HPC clusters. △ Less

Submitted 2 October, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

arXiv:2303.11811 [pdf, other]

Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures

Authors: Samuel Kemmler, Christoph Rettinger, Ulrich Rüde, Pablo Cuéllar, Harald Köstler

Abstract: Current supercomputers often have a heterogeneous architecture using both CPUs and GPUs. At the same time, numerical simulation tasks frequently involve multiphysics scenarios whose components run on different hardware due to multiple reasons, e.g., architectural requirements, pragmatism, etc. This leads naturally to a software design where different simulation modules are mapped to different subs… ▽ More Current supercomputers often have a heterogeneous architecture using both CPUs and GPUs. At the same time, numerical simulation tasks frequently involve multiphysics scenarios whose components run on different hardware due to multiple reasons, e.g., architectural requirements, pragmatism, etc. This leads naturally to a software design where different simulation modules are mapped to different subsystems of the heterogeneous architecture. We present a detailed performance analysis for such a hybrid four-way coupled simulation of a fully resolved particle-laden flow. The Eulerian representation of the flow utilizes GPUs, while the Lagrangian model for the particles runs on CPUs. First, a roofline model is employed to predict the node level performance and to show that the lattice-Boltzmann-based fluid simulation reaches very good performance on a single GPU. Furthermore, the GPU-GPU communication for a large-scale flow simulation results in only moderate slowdowns due to the efficiency of the CUDA-aware MPI communication, combined with communication hiding techniques. On 1024 A100 GPUs, a parallel efficiency of up to 71% is achieved. While the flow simulation has good performance characteristics, the integration of the stiff Lagrangian particle system requires frequent CPU-CPU communications that can become a bottleneck. Additionally, special attention is paid to the CPU-GPU communication overhead since this is essential for coupling the particles to the flow simulation. However, thanks to our problem-aware co-partitioning, the CPU-GPU communication overhead is found to be negligible. As a lesson learned from this development, four criteria are postulated that a hybrid implementation must meet for the efficient use of heterogeneous supercomputers. Additionally, an a priori estimate of the speedup for hybrid implementations is suggested. △ Less

Submitted 9 December, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

arXiv:2302.14660 [pdf, other]

MD-Bench: Engineering the in-core performance of short-range molecular dynamics kernels from state-of-the-art simulation packages

Authors: Rafael Ravedutti Lucio Machado, Jan Eitzinger, Jan Laukemann, Georg Hager, Harald Köstler, Gerhard Wellein

Abstract: Molecular dynamics (MD) simulations provide considerable benefits for the investigation and experimentation of systems at atomic level. Their usage is widespread into several research fields, but their system size and timescale are also crucially limited by the computing power they can make use of. Performance engineering of MD kernels is therefore important to understand their bottlenecks and poi… ▽ More Molecular dynamics (MD) simulations provide considerable benefits for the investigation and experimentation of systems at atomic level. Their usage is widespread into several research fields, but their system size and timescale are also crucially limited by the computing power they can make use of. Performance engineering of MD kernels is therefore important to understand their bottlenecks and point out possible improvements. For that reason, we developed MD-Bench, a proxy-app for short-range MD kernels that implements state-of-the-art algorithms from multiple production applications such as LAMMPS and GROMACS. MD-Bench is intended to have simpler, understandable and extensible source code, as well as to be transparent and suitable for teaching, benchmarking and researching MD algorithms. In this paper we introduce MD-Bench, describe its design and structure and implemented algorithms. Finally, we show five usage examples of MD-Bench and describe how these are useful to have a deeper understanding of MD kernels from a performance point of view, also exposing some interesting performance insights. △ Less

Submitted 22 February, 2023; originally announced February 2023.

Comments: 17 pages, 10 figures, 5 tables. arXiv admin note: text overlap with arXiv:2207.13094

arXiv:2301.10674 [pdf, other]

doi 10.1017/jfm.2023.262

Particle-resolved simulation of antidunes in free-surface flows

Authors: Christoph Schwarzmeier, Christoph Rettinger, Samuel Kemmler, Jonas Plewinski, Francisco Núñez-González, Harald Köstler, Ulrich Rüde, Bernhard Vowinckel

Abstract: The interaction of supercritical turbulent flows with granular sediment beds is challenging to study both experimentally and numerically; this challenging task has hampered the advances in understanding antidunes, the most characteristic bedform of supercritical flows. This article presents the first numerical attempt to simulate upstream-migrating antidunes with geometrically resolved particles a… ▽ More The interaction of supercritical turbulent flows with granular sediment beds is challenging to study both experimentally and numerically; this challenging task has hampered the advances in understanding antidunes, the most characteristic bedform of supercritical flows. This article presents the first numerical attempt to simulate upstream-migrating antidunes with geometrically resolved particles and a liquid-gas interface. Our simulations provide data at a resolution higher than laboratory experiments, and they can therefore provide new insights into the mechanisms of antidune migration and contribute to a deeper understanding of the underlying physics. To manage the simulations' computational costs and physical complexity, we employ the cumulant lattice Boltzmann method in conjunction with a discrete element method for particle interactions, as well as a volume of fluid scheme to track the deformable free surface of the fluid. By reproducing two flow configurations of previous experiments (Pascal et al., Earth Surf. Proc. Land., vol. 46(9), 2021, 1750-1765), we demonstrate that our approach is robust and accurately predicts the antidunes' amplitude, wavelength, and celerity. Furthermore, the simulated wall-shear stress, a key parameter governing sediment transport, is in excellent agreement with the experimental measurements. The highly resolved data of fluid and particle motion from our simulation approach open new perspectives for detailed studies of morphodynamics in shallow supercritical flows. △ Less

Submitted 23 March, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

Journal ref: Journal of Fluid Mechanics 961 (2023)

arXiv:2207.13094 [pdf, other]

MD-Bench: A generic proxy-app toolbox for state-of-the-art molecular dynamics algorithms

Authors: Rafael Ravedutti Lucio Machado, Jan Eitzinger, Harald Köstler, Gerhard Wellein

Abstract: Proxy-apps, or mini-apps, are simple self-contained benchmark codes with performance-relevant kernels extracted from real applications. Initially used to facilitate software-hardware co-design, they are a crucial ingredient for serious performance engineering, especially when dealing with large-scale production codes. MD-Bench is a new proxy-app in the area of classical short-range molecular dynam… ▽ More Proxy-apps, or mini-apps, are simple self-contained benchmark codes with performance-relevant kernels extracted from real applications. Initially used to facilitate software-hardware co-design, they are a crucial ingredient for serious performance engineering, especially when dealing with large-scale production codes. MD-Bench is a new proxy-app in the area of classical short-range molecular dynamics. In contrast to existing proxy-apps in MD (e.g. miniMD and coMD) it does not resemble a single application code, but implements state-of-the art algorithms from multiple applications (currently LAMMPS and GROMACS). The MD-Bench source code is understandable, extensible and suited for teaching, benchmarking and researching MD algorithms. Primary design goals are transparency and simplicity, a developer is able to tinker with the source code down to the assembly level. This paper introduces MD-Bench, explains its design and structure, covers implemented optimization variants, and illustrates its usage on three examples. △ Less

Submitted 26 July, 2022; originally announced July 2022.

Comments: 12 Pages, 2 figures, submitted to PPAM22

arXiv:2204.12846 [pdf, other]

doi 10.1145/3512290.3528688

Evolving Generalizable Multigrid-Based Helmholtz Preconditioners with Grammar-Guided Genetic Programming

Authors: Jonas Schmitt, Harald Köstler

Abstract: Solving the indefinite Helmholtz equation is not only crucial for the understanding of many physical phenomena but also represents an outstandingly-difficult benchmark problem for the successful application of numerical methods. Here we introduce a new approach for evolving efficient preconditioned iterative solvers for Helmholtz problems with multi-objective grammar-guided genetic programming. Ou… ▽ More Solving the indefinite Helmholtz equation is not only crucial for the understanding of many physical phenomena but also represents an outstandingly-difficult benchmark problem for the successful application of numerical methods. Here we introduce a new approach for evolving efficient preconditioned iterative solvers for Helmholtz problems with multi-objective grammar-guided genetic programming. Our approach is based on a novel context-free grammar, which enables the construction of multigrid preconditioners that employ a tailored sequence of operations on each discretization level. To find solvers that generalize well over the given domain, we propose a custom method of successive problem difficulty adaption, in which we evaluate a preconditioner's efficiency on increasingly ill-conditioned problem instances. We demonstrate our approach's effectiveness by evolving multigrid-based preconditioners for a two-dimensional indefinite Helmholtz problem that outperform several human-designed methods for different wavenumbers up to systems of linear equations with more than a million unknowns. △ Less

Submitted 28 April, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

Journal ref: Proceedings of the 2022 Genetic and Evolutionary Computation Conference (Boston, USA) (GECCO '22)

arXiv:2108.05798 [pdf]

doi 10.4271/15-15-02-0006

Deep Learning for Real-Time Aerodynamic Evaluations of Arbitrary Vehicle Shapes

Authors: Sam Jacob Jacob, Markus Mrosek, Carsten Othmer, Harald Köstler

Abstract: The aerodynamic optimization process of cars requires multiple iterations between aerodynamicists and stylists. Response Surface Modeling and Reduced-Order Modeling are commonly used to eliminate the overhead due to Computational Fluid Dynamics, leading to faster iterations. However, a primary drawback of these models is that they can work only on the parametrized geometric features they were trai… ▽ More The aerodynamic optimization process of cars requires multiple iterations between aerodynamicists and stylists. Response Surface Modeling and Reduced-Order Modeling are commonly used to eliminate the overhead due to Computational Fluid Dynamics, leading to faster iterations. However, a primary drawback of these models is that they can work only on the parametrized geometric features they were trained with. This study evaluates if deep learning models can predict the drag coefficient for an arbitrary input geometry without explicit parameterization. We use two similar data sets based on the publicly available DrivAer geometry for training. We use a modified U-Net architecture that uses Signed Distance Fields to represent the input geometries. Our models outperform the existing models by at least 11% in prediction accuracy for the drag coefficient. We achieved this improvement by combining multiple data sets that were created using different parameterizations, which is not possible with the methods currently used. We have also shown that it is possible to predict velocity fields and drag coefficient concurrently and that simple data augmentation methods can improve the results. In addition, we have performed an occlusion sensitivity study on our models to understand what information is used to make predictions. From the occlusion sensitivity study, we showed that the models were able to identify the geometric features and have discovered that the model has learned to exploit the global information present in the SDF. In contrast to the currently operational method, where design changes are restricted to the initially defined parameters, this study brings surrogate models one step closer to the long-term goal of having a model that can be used for approximate aerodynamic evaluation of unseen, arbitrary vehicle shapes, thereby providing complete design freedom to the vehicle stylists. △ Less

Submitted 12 August, 2021; originally announced August 2021.

arXiv:2108.04543 [pdf, other]

doi 10.1088/2516-1091/ac5b13

Known Operator Learning and Hybrid Machine Learning in Medical Imaging -- A Review of the Past, the Present, and the Future

Authors: Andreas Maier, Harald Köstler, Marco Heisig, Patrick Krauss, Seung Hee Yang

Abstract: In this article, we perform a review of the state-of-the-art of hybrid machine learning in medical imaging. We start with a short summary of the general developments of the past in machine learning and how general and specialized approaches have been in competition in the past decades. A particular focus will be the theoretical and experimental evidence pro and contra hybrid modelling. Next, we in… ▽ More In this article, we perform a review of the state-of-the-art of hybrid machine learning in medical imaging. We start with a short summary of the general developments of the past in machine learning and how general and specialized approaches have been in competition in the past decades. A particular focus will be the theoretical and experimental evidence pro and contra hybrid modelling. Next, we inspect several new developments regarding hybrid machine learning with a particular focus on so-called known operator learning and how hybrid approaches gain more and more momentum across essentially all applications in medical imaging and medical image analysis. As we will point out by numerous examples, hybrid models are taking over in image reconstruction and analysis. Even domains such as physical simulation and scanner and acquisition design are being addressed using machine learning grey box modelling approaches. Towards the end of the article, we will investigate a few future directions and point out relevant areas in which hybrid modelling, meta learning, and other domains will likely be able to drive the state-of-the-art ahead. △ Less

Submitted 10 August, 2021; originally announced August 2021.

Comments: 22 pages, 4 figures, submitted to "Progress in Biomedical Engineering"

Journal ref: Prog. Biomed. Eng. 4 022002 (2022)

arXiv:2009.07400 [pdf, other]

tinyMD: A Portable and Scalable Implementation for Pairwise Interactions Simulations

Authors: Rafael Ravedutti L. Machado, Jonas Schmitt, Sebastian Eibl, Jan Eitzinger, Roland Leißa, Sebastian Hack, Arsène Pérard-Gayot, Richard Membarth, Harald Köstler

Abstract: This paper investigates the suitability of the AnyDSL partial evaluation framework to implement tinyMD: an efficient, scalable, and portable simulation of pairwise interactions among particles. We compare tinyMD with the miniMD proxy application that scales very well on parallel supercomputers. We discuss the differences between both implementations and contrast miniMD's performance for single-nod… ▽ More This paper investigates the suitability of the AnyDSL partial evaluation framework to implement tinyMD: an efficient, scalable, and portable simulation of pairwise interactions among particles. We compare tinyMD with the miniMD proxy application that scales very well on parallel supercomputers. We discuss the differences between both implementations and contrast miniMD's performance for single-node CPU and GPU targets, as well as its scalability on SuperMUC-NG and Piz Daint supercomputers. Additionaly, we demonstrate tinyMD's flexibility by coupling it with the waLBerla multi-physics framework. This allow us to execute tinyMD simulations using the load-balancing mechanism implemented in waLBerla. △ Less

Submitted 15 September, 2020; originally announced September 2020.

Comments: 35 pages, 8 figures, submitted to Journal of Computational Science

MSC Class: B.8.2; D.1.3; D.3.3; J.2

arXiv:2006.09127 [pdf, other]

Quantum simulation and circuit design for solving multidimensional Poisson equations

Authors: Michael Holzmann, Harald Koestler

Abstract: Many methods solve Poisson equations by using grid techniques which discretize the problem in each dimension. Most of these algorithms are subject to the curse of dimensionality, so that they need exponential runtime. In the paper "Quantum algorithm and circuit design solving the Poisson equation" a quantum algorithm is shown running in polylog time to produce a quantum state representing the solu… ▽ More Many methods solve Poisson equations by using grid techniques which discretize the problem in each dimension. Most of these algorithms are subject to the curse of dimensionality, so that they need exponential runtime. In the paper "Quantum algorithm and circuit design solving the Poisson equation" a quantum algorithm is shown running in polylog time to produce a quantum state representing the solution of the Poisson equation. In this paper a quantum simulation of an extended circuit design based on this algorithm is made on a classical computer. Our purpose is to test an efficient circuit design which can break the curse of dimensionality on a quantum computer. Due to the exponential rise of the Hilbert space this design is optimized on a small number of qubits. We use Microsoft's Quantum Development Kit and its simulator of an ideal quantum computer to validate the correctness of this algorithm. △ Less

Submitted 16 June, 2020; originally announced June 2020.

arXiv:2001.11806 [pdf, other]

lbmpy: Automatic code generation for efficient parallel lattice Boltzmann methods

Authors: Martin Bauer, Harald Köstler, Ulrich Rüde

Abstract: Lattice Boltzmann methods are a popular mesoscopic alternative to macroscopic computational fluid dynamics solvers. Many variants have been developed that vary in complexity, accuracy, and computational cost. Extensions are available to simulate multi-phase, multi-component, turbulent, or non-Newtonian flows. In this work we present lbmpy, a code generation package that supports a wide variety of… ▽ More Lattice Boltzmann methods are a popular mesoscopic alternative to macroscopic computational fluid dynamics solvers. Many variants have been developed that vary in complexity, accuracy, and computational cost. Extensions are available to simulate multi-phase, multi-component, turbulent, or non-Newtonian flows. In this work we present lbmpy, a code generation package that supports a wide variety of different methods and provides a generic development environment for new schemes as well. A high-level domain-specific language allows the user to formulate, extend and test various lattice Boltzmann schemes. The method specification is represented in a symbolic intermediate representation. Transformations that operate on this intermediate representation optimize and parallelize the method, yielding highly efficient lattice Boltzmann compute kernels not only for single- and two-relaxation-time schemes but also for multi-relaxation-time, cumulant, and entropically stabilized methods. An integration into the HPC framework waLBerla makes massively parallel, distributed simulations possible, which is demonstrated through scaling experiments on the SuperMUC-NG supercomputing system △ Less

Submitted 11 April, 2020; v1 submitted 31 January, 2020; originally announced January 2020.

arXiv:1910.02749 [pdf, other]

Optimizing Geometric Multigrid Methods with Evolutionary Computation

Authors: Jonas Schmitt, Sebastian Kuckuk, Harald Köstler

Abstract: For many linear and nonlinear systems that arise from the discretization of partial differential equations the construction of an efficient multigrid solver is a challenging task. Here we present a novel approach for the optimization of geometric multigrid methods that is based on evolutionary computation, a generic program optimization technique inspired by the principle of natural evolution. A m… ▽ More For many linear and nonlinear systems that arise from the discretization of partial differential equations the construction of an efficient multigrid solver is a challenging task. Here we present a novel approach for the optimization of geometric multigrid methods that is based on evolutionary computation, a generic program optimization technique inspired by the principle of natural evolution. A multigrid solver is represented as a tree of mathematical expressions which we generate based on a tailored grammar. The quality of each solver is evaluated in terms of convergence and compute performance using automated local Fourier analysis (LFA) and roofline performance modeling, respectively. Based on these objectives a multi-objective optimization is performed using strongly typed genetic programming with a non-dominated sorting based selection. To evaluate the model-based prediction and to target concrete applications, scalable implementations of an evolved solver can be automatically generated with the ExaStencils framework. We demonstrate our approach by constructing multigrid solvers for the steady-state heat equation with constant and variable coefficients that consistently perform better than common V- and W-cycles. △ Less

Submitted 8 October, 2019; v1 submitted 7 October, 2019; originally announced October 2019.

arXiv:1909.13772 [pdf, other]

doi 10.1016/j.camwa.2020.01.007

waLBerla: A block-structured high-performance framework for multiphysics simulations

Authors: Martin Bauer, Sebastian Eibl, Christian Godenschwager, Nils Kohl, Michael Kuron, Christoph Rettinger, Florian Schornbaum, Christoph Schwarzmeier, Dominik Thönnes, Harald Köstler, Ulrich Rüde

Abstract: Programming current supercomputers efficiently is a challenging task. Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system. Heterogeneous hardware architectures with accelerators further complicate the development process. waLBerla addresses these challenges by providing the user with highly efficient building blocks… ▽ More Programming current supercomputers efficiently is a challenging task. Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system. Heterogeneous hardware architectures with accelerators further complicate the development process. waLBerla addresses these challenges by providing the user with highly efficient building blocks for developing simulations on block-structured grids. The block-structured domain partitioning is flexible enough to handle complex geometries, while the structured grid within each block allows for highly efficient implementations of stencil-based algorithms. We present several example applications realized with waLBerla, ranging from lattice Boltzmann methods to rigid particle simulations. Most importantly, these methods can be coupled together, enabling multiphysics simulations. The framework uses meta-programming techniques to generate highly efficient code for CPUs and GPUs from a symbolic method formulation. To ensure software quality and performance portability, a continuous integration toolchain automatically runs an extensive test suite encompassing multiple compilers, hardware architectures, and software configurations. △ Less

Submitted 30 September, 2019; originally announced September 2019.

arXiv:1904.08684 [pdf, other]

Towards whole program generation of quadrature-free discontinuous Galerkin methods for the shallow water equations

Authors: Sara Faghih-Naini, Sebastian Kuckuk, Vadym Aizinger, Daniel Zint, Roberto Grosso, Harald Köstler

Abstract: The shallow water equations (SWE) are a commonly used model to study tsunamis, tides, and coastal ocean circulation. However, there exist various approaches to discretize and solve them efficiently. Which of them is best for a certain scenario is often not known and, in addition, depends heavily on the used HPC platform. From a simulation software perspective, this places a premium on the ability… ▽ More The shallow water equations (SWE) are a commonly used model to study tsunamis, tides, and coastal ocean circulation. However, there exist various approaches to discretize and solve them efficiently. Which of them is best for a certain scenario is often not known and, in addition, depends heavily on the used HPC platform. From a simulation software perspective, this places a premium on the ability to adapt easily to different numerical methods and hardware architectures. One solution to this problem is to apply code generation techniques and to express methods and specific hardware-dependent implementations on different levels of abstraction. This allows for a separation of concerns and makes it possible, e.g., to exchange the discretization scheme without having to rewrite all low-level optimized routines manually. In this paper, we show how code for an advanced quadrature-free discontinuous Galerkin (DG) discretized shallow water equation solver can be generated. Here, we follow the multi-layered approach from the ExaStencils project that starts from the continuous problem formulation, moves to the discrete scheme, spells out the numerical algorithms, and, finally, maps to a representation that can be transformed to a distributed memory parallel implementation by our in-house Scala-based source-to-source compiler. Our contributions include: A new quadrature-free discontinuous Galerkin formulation, an extension of the class of supported computational grids, and an extension of our toolchain allowing to evaluate discrete integrals stemming from the DG discretization implemented in Python. As first results we present the whole toolchain and also demonstrate the convergence of our method for higher order DG discretizations. △ Less

Submitted 18 April, 2019; originally announced April 2019.

arXiv:1711.11468 [pdf, other]

doi 10.1016/j.compfluid.2018.03.030

Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis

Authors: Markus Wittmann, Viktor Haag, Thomas Zeiser, Harald Köstler, Gerhard Wellein

Abstract: Lattice Boltzmann methods (LBM) are an important part of current computational fluid dynamics (CFD). They allow easy implementations and boundary handling. However, competitive time to solution not only depends on the choice of a reasonable method, but also on an efficient implementation on modern hardware. Hence, performance optimization has a long history in the lattice Boltzmann community. A va… ▽ More Lattice Boltzmann methods (LBM) are an important part of current computational fluid dynamics (CFD). They allow easy implementations and boundary handling. However, competitive time to solution not only depends on the choice of a reasonable method, but also on an efficient implementation on modern hardware. Hence, performance optimization has a long history in the lattice Boltzmann community. A variety of options exists regarding the implementation with direct impact on the solver performance. Experimenting and evaluating each option often is hard as the kernel itself is typically embedded in a larger code base. With our suite of lattice Boltzmann kernels we provide the infrastructure for such endeavors. Already included are several kernels ranging from simple to fully optimized implementations. Although these kernels are not fully functional CFD solvers, they are equipped with a solid verification method. The kernels may act as an reference for performance comparisons and as a blue print for optimization strategies. In this paper we give an overview of already available kernels, establish a performance model for each kernel, and show a comparison of implementations and recent architectures. △ Less

Submitted 30 November, 2017; originally announced November 2017.

Comments: preprint, submitted to Computer & Fluids Special Issue DSFD2017

Journal ref: Computers & Fluids, 2018

arXiv:1708.08286 [pdf, other]

A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations

Authors: Nils Kohl, Johannes Hötzer, Florian Schornbaum, Martin Bauer, Christian Godenschwager, Harald Köstler, Britta Nestler, Ulrich Rüde

Abstract: Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems it is therefore considered critical that strategies are developed to make software resilient agains… ▽ More Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems it is therefore considered critical that strategies are developed to make software resilient against failures. In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. We demonstrate the efficiency and scalability of the checkpoint strategy for simulations with up to $40$ billion computational cells executing on more than $400$ billion floating point values. A checkpoint creation is shown to require only a few seconds and the new checkpointing scheme scales almost perfectly up to more than $260\,000$ ($2^{18}$) processes. To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI. The checkpointing mechanism is fully integrated in a state-of-the-art high-performance multi-physics simulation framework. We demonstrate the efficiency and robustness of the method with a realistic phase-field simulation originating in the material sciences and with a lattice Boltzmann method implementation. △ Less

Submitted 29 January, 2018; v1 submitted 28 August, 2017; originally announced August 2017.

arXiv:1511.07261 [pdf, other]

doi 10.1080/17445760.2015.1118478

A Python Extension for the Massively Parallel Multiphysics Simulation Framework waLBerla

Authors: Martin Bauer, Florian Schornbaum, Christian Godenschwager, Matthias Markl, Daniela Anderl, Harald Köstler, Ulrich Rüde

Abstract: We present a Python extension to the massively parallel HPC simulation toolkit waLBerla. waLBerla is a framework for stencil based algorithms operating on block-structured grids, with the main application field being fluid simulations in complex geometries using the lattice Boltzmann method. Careful performance engineering results in excellent node performance and good scalability to over 400,000… ▽ More We present a Python extension to the massively parallel HPC simulation toolkit waLBerla. waLBerla is a framework for stencil based algorithms operating on block-structured grids, with the main application field being fluid simulations in complex geometries using the lattice Boltzmann method. Careful performance engineering results in excellent node performance and good scalability to over 400,000 cores. To increase the usability and flexibility of the framework, a Python interface was developed. Python extensions are used at all stages of the simulation pipeline: They simplify and automate scenario setup, evaluation, and plotting. We show how our Python interface outperforms the existing text-file-based configuration mechanism, providing features like automatic nondimensionalization of physical quantities and handling of complex parameter dependencies. Furthermore, Python is used to process and evaluate results while the simulation is running, leading to smaller output files and the possibility to adjust parameters dependent on the current simulation state. C++ data structures are exported such that a seamless interfacing to other numerical Python libraries is possible. The expressive power of Python and the performance of C++ make development of efficient code with low time effort possible. △ Less

Submitted 23 November, 2015; originally announced November 2015.

arXiv:1506.01684 [pdf, other]

Massively Parallel Phase-Field Simulations for Ternary Eutectic Directional Solidification

Authors: Martin Bauer, Johannes Hötzer, Philipp Steinmetz, Marcus Jainta, Marco Berghoff, Florian Schornbaum, Christian Godenschwager, Harald Köstler, Britta Nestler, Ulrich Rüde

Abstract: Microstructures forming during ternary eutectic directional solidification processes have significant influence on the macroscopic mechanical properties of metal alloys. For a realistic simulation, we use the well established thermodynamically consistent phase-field method and improve it with a new grand potential formulation to couple the concentration evolution. This extension is very compute in… ▽ More Microstructures forming during ternary eutectic directional solidification processes have significant influence on the macroscopic mechanical properties of metal alloys. For a realistic simulation, we use the well established thermodynamically consistent phase-field method and improve it with a new grand potential formulation to couple the concentration evolution. This extension is very compute intensive due to a temperature dependent diffusive concentration. We significantly extend previous simulations that have used simpler phase-field models or were performed on smaller domain sizes. The new method has been implemented within the massively parallel HPC framework waLBerla that is designed to exploit current supercomputers efficiently. We apply various optimization techniques, including buffering techniques, explicit SIMD kernel vectorization, and communication hiding. Simulations utilizing up to 262,144 cores have been run on three different supercomputing architectures and weak scalability results are shown. Additionally, a hierarchical, mesh-based data reduction strategy is developed to keep the I/O problem manageable at scale. △ Less

Submitted 4 June, 2015; originally announced June 2015.

Comments: submitted to Supercomputing 2015

arXiv:1406.5369 [pdf, other]

A Scala Prototype to Generate Multigrid Solver Implementations for Different Problems and Target Multi-Core Platforms

Authors: Harald Koestler, Christian Schmitt, Sebastian Kuckuk, Frank Hannig, Juergen Teich, Ulrich Ruede

Abstract: Many problems in computational science and engineering involve partial differential equations and thus require the numerical solution of large, sparse (non)linear systems of equations. Multigrid is known to be one of the most efficient methods for this purpose. However, the concrete multigrid algorithm and its implementation highly depend on the underlying problem and hardware. Therefore, changes… ▽ More Many problems in computational science and engineering involve partial differential equations and thus require the numerical solution of large, sparse (non)linear systems of equations. Multigrid is known to be one of the most efficient methods for this purpose. However, the concrete multigrid algorithm and its implementation highly depend on the underlying problem and hardware. Therefore, changes in the code or many different variants are necessary to cover all relevant cases. In this article we provide a prototype implementation in Scala for a framework that allows abstract descriptions of PDEs, their discretization, and their numerical solution via multigrid algorithms. From these, one is able to generate data structures and implementations of multigrid components required to solve elliptic PDEs on structured grids. Two different test problems showcase our proposed automatic generation of multigrid solvers for both CPU and GPU target platforms. △ Less

Submitted 20 June, 2014; originally announced June 2014.

arXiv:1112.0850 [pdf, ps, other]

Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results

Authors: Johannes Habich, Christian Feichtinger, Harald Köstler, Georg Hager, Gerhard Wellein

Abstract: GPUs offer several times the floating point performance and memory bandwidth of current standard two socket CPU servers, e.g. NVIDIA C2070 vs. Intel Xeon Westmere X5650. The lattice Boltzmann method has been established as a flow solver in recent years and was one of the first flow solvers to be successfully ported and that performs well on GPUs. We demonstrate advanced optimization strategies for… ▽ More GPUs offer several times the floating point performance and memory bandwidth of current standard two socket CPU servers, e.g. NVIDIA C2070 vs. Intel Xeon Westmere X5650. The lattice Boltzmann method has been established as a flow solver in recent years and was one of the first flow solvers to be successfully ported and that performs well on GPUs. We demonstrate advanced optimization strategies for a D3Q19 lattice Boltzmann based incompressible flow solver for GPGPUs and CPUs based on NVIDIA CUDA and OpenCL. Since the implemented algorithm is limited by memory bandwidth, we concentrate on improving memory access. Basic data layout issues for optimal data access are explained and discussed. Furthermore, the algorithmic steps are rearranged to improve scattered access of the GPU memory. The importance of occupancy is discussed as well as optimization strategies to improve overall concurrency. We arrive at a well-optimized GPU kernel, which is integrated into a larger framework that can handle single phase fluid flow simulations as well as particle-laden flows. Our 3D LBM GPU implementation reaches up to 650 MLUPS in single precision and 290 MLUPS in double precision on an NVIDIA Tesla C2070. △ Less

Submitted 5 December, 2011; originally announced December 2011.

Comments: 10 pages, 7 figures, 4 tables, preprint submitted to Computers and Fluids journal

arXiv:1007.1388 [pdf, ps, other]

doi 10.1016/j.parco.2011.03.005

A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters

Authors: Christian Feichtinger, Johannes Habich, Harald Koestler, Georg Hager, Ulrich Ruede, Gerhard Wellein

Abstract: Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. In this article, this topic is addressed in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. We propose a multi-GPU implementation using a block-structured MPI parallelization, suitable for load balancing an… ▽ More Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. In this article, this topic is addressed in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. We propose a multi-GPU implementation using a block-structured MPI parallelization, suitable for load balancing and heterogeneous computations on CPUs and GPUs. The overhead required for multi-GPU simulations is discussed in detail and it is demonstrated that the kernel performance can be sustained to a large extent. With our GPU implementation, we achieve nearly perfect weak scalability on InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost analysis must determine the best course of action for a particular simulation task. Additionally, weak scaling results of heterogeneous simulations conducted on CPUs and GPUs simultaneously are presented using clusters equipped with varying node configurations. △ Less

Submitted 8 July, 2010; originally announced July 2010.

Comments: 20 pages, 12 figures

Journal ref: Parallel Computing 37(9), 536-549 (2011)

Showing 1–33 of 33 results for author: Köstler, H