-
Robustness and uncertainty of direct numerical simulation under the influence of rounding and noise
Authors:
Martin Karp,
Niclas Jansson,
Saleh Rezaeiravesh,
Stefano Markidis,
Philipp Schlatter
Abstract:
Numerical precision in large-scale scientific computations has become an emerging topic due to recent developments in computer hardware. Lower floating point precision offers the potential for significant performance improvements, but the uncertainty added from reducing the numerical precision is a major obstacle for it to reach prevalence in high-fidelity simulations of turbulence. In the present…
▽ More
Numerical precision in large-scale scientific computations has become an emerging topic due to recent developments in computer hardware. Lower floating point precision offers the potential for significant performance improvements, but the uncertainty added from reducing the numerical precision is a major obstacle for it to reach prevalence in high-fidelity simulations of turbulence. In the present work, the impact of reducing the numerical precision under different rounding schemes is investigated and compared to the presence of white noise in the simulation data to obtain statistical averages of different quantities in the flow. To investigate how this impacts the simulation, an experimental methodology to assess the impact of these sources of uncertainty is proposed, in which each realization $u^i$ at time $t_i$ is perturbed, either by constraining the flow to a coarser discretization of the phase space (corresponding to low precision formats rounded with deterministic and stochastic rounding) or by perturbing the flow with white noise with a uniform distribution. The purpose of this approach is to assess the limiting factors for precision, and how robust a direct numerical simulation (DNS) is to noise and numerical precision. Our results indicate that for low-Re turbulent channel flow, stochastic rounding and noise impacts the results significantly less than deterministic rounding, indicating potential benefits of stochastic rounding over conventional round-to-nearest. We find that to capture the probability density function of the velocity change in time, the floating point precision is especially important in regions with small relative velocity changes and low turbulence intensity, but less important in regions with large velocity gradients and variations such as in the near-wall region.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Enabling Mixed-Precision in Computational Fluids Dynamics Codes
Authors:
Yanxiang Chen,
Pablo de Oliveira Castro,
Paolo Bientinesi,
Niclas Jansson,
Roman Iakymchuk
Abstract:
Mixed-precision computing has the potential to significantly reduce the cost of exascale computations, but determining when and how to implement it in programs can be challenging. In this article, we propose a methodology for enabling mixed-precision with the help of computer arithmetic tools, roofline model, and computer arithmetic techniques. As case studies, we consider Nekbone, a mini-applicat…
▽ More
Mixed-precision computing has the potential to significantly reduce the cost of exascale computations, but determining when and how to implement it in programs can be challenging. In this article, we propose a methodology for enabling mixed-precision with the help of computer arithmetic tools, roofline model, and computer arithmetic techniques. As case studies, we consider Nekbone, a mini-application for the Computational Fluid Dynamics (CFD) solver Nek5000, and a modern Neko CFD application. With the help of the VerifiCarlo tool and computer arithmetic techniques, we introduce a strategy to address stagnation issues in the preconditioned Conjugate Gradient method in Nekbone and apply these insights to implement a mixed-precision version of Neko. We evaluate the derived mixed-precision versions of these codes by combining metrics in three dimensions: accuracy, time-to-solution, and energy-to-solution. Notably, mixed-precision in Nekbone reduces time-to-solution by roughly 38% and energy-to-solution by 2.8x on MareNostrum 5, while in the real-world Neko application the gain is up to 29% in time and up to 24% in energy, without sacrificing the accuracy.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Port-Hamiltonian Neural Networks with Output Error Noise Models
Authors:
Sarvin Moradi,
Gerben I. Beintema,
Nick Jaensson,
Roland Tóth,
Maarten Schoukens
Abstract:
Hamiltonian neural networks (HNNs) represent a promising class of physics-informed deep learning methods that utilize Hamiltonian theory as foundational knowledge within neural networks. However, their direct application to engineering systems is often challenged by practical issues, including the presence of external inputs, dissipation, and noisy measurements. This paper introduces a novel frame…
▽ More
Hamiltonian neural networks (HNNs) represent a promising class of physics-informed deep learning methods that utilize Hamiltonian theory as foundational knowledge within neural networks. However, their direct application to engineering systems is often challenged by practical issues, including the presence of external inputs, dissipation, and noisy measurements. This paper introduces a novel framework that enhances the capabilities of HNNs to address these real-life factors. We integrate port-Hamiltonian theory into the neural network structure, allowing for the inclusion of external inputs and dissipation, while mitigating the impact of measurement noise through an output-error (OE) model structure. The resulting output error port-Hamiltonian neural networks (OE-pHNNs) can be adapted to tackle modeling complex engineering systems with noisy measurements. Furthermore, we propose the identification of OE-pHNNs based on the subspace encoder approach (SUBNET), which efficiently approximates the complete simulation loss using subsections of the data and uses an encoder function to predict initial states. By integrating SUBNET with OE-pHNNs, we achieve consistent models of complex engineering systems under noisy measurements. In addition, we perform a consistency analysis to ensure the reliability of the proposed data-driven model learning method. We demonstrate the effectiveness of our approach on system identification benchmarks, showing its potential as a powerful tool for modeling dynamic systems in real-world applications.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Learning Subsystem Dynamics in Nonlinear Systems via Port-Hamiltonian Neural Networks
Authors:
G. J. E. van Otterdijk,
S. Moradi,
S. Weiland,
R. Tóth,
N. O. Jaensson,
M. Schoukens
Abstract:
Port-Hamiltonian neural networks (pHNNs) are emerging as a powerful modeling tool that integrates physical laws with deep learning techniques. While most research has focused on modeling the entire dynamics of interconnected systems, the potential for identifying and modeling individual subsystems while operating as part of a larger system has been overlooked. This study addresses this gap by intr…
▽ More
Port-Hamiltonian neural networks (pHNNs) are emerging as a powerful modeling tool that integrates physical laws with deep learning techniques. While most research has focused on modeling the entire dynamics of interconnected systems, the potential for identifying and modeling individual subsystems while operating as part of a larger system has been overlooked. This study addresses this gap by introducing a novel method for using pHNNs to identify such subsystems based solely on input-output measurements. By utilizing the inherent compositional property of the port-Hamiltonian systems, we developed an algorithm that learns the dynamics of individual subsystems, without requiring direct access to their internal states. On top of that, by choosing an output error (OE) model structure, we have been able to handle measurement noise effectively. The effectiveness of the proposed approach is demonstrated through tests on interconnected systems, including multi-physics scenarios, demonstrating its potential for identifying subsystem dynamics and facilitating their integration into new interconnected models.
△ Less
Submitted 8 November, 2024;
originally announced November 2024.
-
In-Situ Techniques on GPU-Accelerated Data-Intensive Applications
Authors:
Yi Ju,
Mingshuai Li,
Adalberto Perez,
Laura Bellentani,
Niclas Jansson,
Stefano Markidis,
Philipp Schlatter,
Erwin Laure
Abstract:
The computational power of High-Performance Computing (HPC) systems is constantly increasing, however, their input/output (IO) performance grows relatively slowly, and their storage capacity is also limited. This unbalance presents significant challenges for applications such as Molecular Dynamics (MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of data for further visua…
▽ More
The computational power of High-Performance Computing (HPC) systems is constantly increasing, however, their input/output (IO) performance grows relatively slowly, and their storage capacity is also limited. This unbalance presents significant challenges for applications such as Molecular Dynamics (MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of data for further visualization or analysis. At the same time, checkpointing is crucial for long runs on HPC clusters, due to limited walltimes and/or failures of system components, and typically requires the storage of large amount of data. Thus, restricted IO performance and storage capacity can lead to bottlenecks for the performance of full application workflows (as compared to computational kernels without IO). In-situ techniques, where data is further processed while still in memory rather to write it out over the I/O subsystem, can help to tackle these problems. In contrast to traditional post-processing methods, in-situ techniques can reduce or avoid the need to write or read data via the IO subsystem. They offer a promising approach for applications aiming to leverage the full power of large scale HPC systems. In-situ techniques can also be applied to hybrid computational nodes on HPC systems consisting of graphics processing units (GPUs) and central processing units (CPUs). On one node, the GPUs would have significant performance advantages over the CPUs. Therefore, current approaches for GPU-accelerated applications often focus on maximizing GPU usage, leaving CPUs underutilized. In-situ tasks using CPUs to perform data analysis or preprocess data concurrently to the running simulation, offer a possibility to improve this underutilization.
△ Less
Submitted 30 July, 2024;
originally announced July 2024.
-
Experience and Analysis of Scalable High-Fidelity Computational Fluid Dynamics on Modular Supercomputing Architectures
Authors:
Martin Karp,
Estela Suarez,
Jan H. Meinke,
Måns I. Andersson,
Philipp Schlatter,
Stefano Markidis,
Niclas Jansson
Abstract:
The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high…
▽ More
The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high-fidelity CFD using the spectral element method can exploit the modular supercomputing architecture at scale through domain partitioning, where the computational domain is split between a Booster module powered by GPUs and a Cluster module with conventional CPU nodes. We investigate several different flow cases and computer systems based on the modular supercomputing architecture (MSA). We observe that for our simulations, the communication overhead and load balancing issues incurred by incorporating different computing architectures are seldom worthwhile, especially when I/O is also considered, but when the simulation at hand requires more than the combined global memory on the GPUs, utilizing additional CPUs to increase the available memory can be fruitful. We support our results with a simple performance model to assess when running across modules might be beneficial. As MSA is becoming more widespread and efforts to increase system utilization are growing more important our results give insight into when and how a monolithic application can utilize and spread out to more than one module and obtain a faster time to solution.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Supercomputers as a Continous Medium
Authors:
Martin Karp,
Niclas Jansson,
Philipp Schlatter,
Stefano Markidis
Abstract:
As supercomputers' complexity has grown, the traditional boundaries between processor, memory, network, and accelerators have blurred, making a homogeneous computer model, in which the overall computer system is modeled as a continuous medium with homogeneously distributed computational power, memory, and data movement transfer capabilities, an intriguing and powerful abstraction. By applying a ho…
▽ More
As supercomputers' complexity has grown, the traditional boundaries between processor, memory, network, and accelerators have blurred, making a homogeneous computer model, in which the overall computer system is modeled as a continuous medium with homogeneously distributed computational power, memory, and data movement transfer capabilities, an intriguing and powerful abstraction. By applying a homogeneous computer model to algorithms with a given I/O complexity, we recover from first principles, other discrete computer models, such as the roofline model, parallel computing laws, such as Amdahl's and Gustafson's laws, and phenomenological observations, such as super-linear speedup. One of the homogeneous computer model's distinctive advantages is the capability of directly linking the performance limits of an application to the physical properties of a classical computer system. Applying the homogeneous computer model to supercomputers, such as Frontier, Fugaku, and the Nvidia DGX GH200, shows that applications, such as Conjugate Gradient (CG) and Fast Fourier Transforms (FFT), are rapidly approaching the fundamental classical computational limits, where the performance of even denser systems in terms of compute and memory are fundamentally limited by the speed of light.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
iFast: Host-Side Logging for Scientific Applications
Authors:
Steven W. D. Chien,
Kento Sato,
Artur Podobas,
Niclas Jansson,
Stefano Markidis,
Michio Honda
Abstract:
We have seen an increase in the heterogeneity of storage technologies potentially available to scientific applications, such as burst buffers, managed cloud parallel file systems (PFS), and object stores. However, those applications cannot easily utilize those technologies, because they are designed for traditional HPC systems that offer very high remote storage and network bandwidth. We present i…
▽ More
We have seen an increase in the heterogeneity of storage technologies potentially available to scientific applications, such as burst buffers, managed cloud parallel file systems (PFS), and object stores. However, those applications cannot easily utilize those technologies, because they are designed for traditional HPC systems that offer very high remote storage and network bandwidth. We present iFast, a new distributed host-side logging approach to transparently accelerating scientific applications. iFast has a strong emphasis on deployability, supporting unmodified MPI applications with unmodified MPI implementations while preserving the crash consistency semantics. We evaluate iFast on traditional HPC, cloud HPC, local cluster, and a hybrid of both, using three scientific applications. iFast reduces end-to-end execution time by 13-26% for popular scientific applications on the cloud. We show for the first time, how an application on a recent production HPC system can write data to S3 storage through fully fledged MPI-IO, in a readily shareable format.
△ Less
Submitted 2 August, 2024; v1 submitted 25 January, 2024;
originally announced January 2024.
-
Physics-Informed Learning Using Hamiltonian Neural Networks with Output Error Noise Models
Authors:
Sarvin Moradi,
Nick Jaensson,
Roland Tóth,
Maarten Schoukens
Abstract:
In order to make data-driven models of physical systems interpretable and reliable, it is essential to include prior physical knowledge in the modeling framework. Hamiltonian Neural Networks (HNNs) implement Hamiltonian theory in deep learning and form a comprehensive framework for modeling autonomous energy-conservative systems. Despite being suitable to estimate a wide range of physical system b…
▽ More
In order to make data-driven models of physical systems interpretable and reliable, it is essential to include prior physical knowledge in the modeling framework. Hamiltonian Neural Networks (HNNs) implement Hamiltonian theory in deep learning and form a comprehensive framework for modeling autonomous energy-conservative systems. Despite being suitable to estimate a wide range of physical system behavior from data, classical HNNs are restricted to systems without inputs and require noiseless state measurements and information on the derivative of the state to be available. To address these challenges, this paper introduces an Output Error Hamiltonian Neural Network (OE-HNN) modeling approach to address the modeling of physical systems with inputs and noisy state measurements. Furthermore, it does not require the state derivatives to be known. Instead, the OE-HNN utilizes an ODE-solver embedded in the training process, which enables the OE-HNN to learn the dynamics from noisy state measurements. In addition, extending HNNs based on the generalized Hamiltonian theory enables to include external inputs into the framework which are important for engineering applications. We demonstrate via simulation examples that the proposed OE-HNNs results in superior modeling performance compared to classical HNNs.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Large-Scale Direct Numerical Simulations of Turbulence Using GPUs and Modern Fortran
Authors:
Martin Karp,
Daniele Massaro,
Niclas Jansson,
Alistair Hart,
Jacob Wahlgren,
Philipp Schlatter,
Stefano Markidis
Abstract:
We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by…
▽ More
We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world's first direct numerical simulation of the flow around a Flettner rotor at Re=30'000 and its interaction with a turbulent boundary layer. We present one of the first performance comparisons between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency.
△ Less
Submitted 23 June, 2022;
originally announced July 2022.
-
Strong Scaling of OpenACC enabled Nek5000 on several GPU based HPC systems
Authors:
Jonathan Vincent,
Jing Gong,
Martin Karp,
Adam Peplinski,
Niclas Jansson,
Artur Podobas,
Andreas Jocksch,
Jie Yao,
Fazle Hussain,
Stefano Markidis,
Matts Karlsson,
Dirk Pleiter,
Erwin Laure,
Philipp Schlatter
Abstract:
We present new results on the strong parallel scaling for the OpenACC-accelerated implementation of the high-order spectral element fluid dynamics solver Nek5000. The test case considered consists of a direct numerical simulation of fully-developed turbulent flow in a straight pipe, at two different Reynolds numbers $Re_τ=360$ and $Re_τ=550$, based on friction velocity and pipe radius. The strong…
▽ More
We present new results on the strong parallel scaling for the OpenACC-accelerated implementation of the high-order spectral element fluid dynamics solver Nek5000. The test case considered consists of a direct numerical simulation of fully-developed turbulent flow in a straight pipe, at two different Reynolds numbers $Re_τ=360$ and $Re_τ=550$, based on friction velocity and pipe radius. The strong scaling is tested on several GPU-enabled HPC systems, including the Swiss Piz Daint system, TACC's Longhorn, Jülich's JUWELS Booster, and Berzelius in Sweden. The performance results show that speed-up between 3-5 can be achieved using the GPU accelerated version compared with the CPU version on these different systems. The run-time for 20 timesteps reduces from 43.5 to 13.2 seconds with increasing the number of GPUs from 64 to 512 for $Re_τ=550$ case on JUWELS Booster system. This illustrates the GPU accelerated version the potential for high throughput. At the same time, the strong scaling limit is significantly larger for GPUs, at about $2000-5000$ elements per rank; compared to about $50-100$ for a CPU-rank.
△ Less
Submitted 4 November, 2021; v1 submitted 8 September, 2021;
originally announced September 2021.
-
A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays
Authors:
Martin Karp,
Artur Podobas,
Tobias Kenter,
Niclas Jansson,
Christian Plessl,
Philipp Schlatter,
Stefano Markidis
Abstract:
The impending termination of Moore's law motivates the search for new forms of computing to continue the performance scaling we have grown accustomed to. Among the many emerging Post-Moore computing candidates, perhaps none is as salient as the Field-Programmable Gate Array (FPGA), which offers the means of specializing and customizing the hardware to the computation at hand.
In this work, we de…
▽ More
The impending termination of Moore's law motivates the search for new forms of computing to continue the performance scaling we have grown accustomed to. Among the many emerging Post-Moore computing candidates, perhaps none is as salient as the Field-Programmable Gate Array (FPGA), which offers the means of specializing and customizing the hardware to the computation at hand.
In this work, we design a custom FPGA-based accelerator for a computational fluid dynamics (CFD) code. Unlike prior work -- which often focuses on accelerating small kernels -- we target the entire Poisson solver on unstructured meshes based on the high-fidelity spectral element method (SEM) used in modern state-of-the-art CFD systems. We model our accelerator using an analytical performance model based on the I/O cost of the algorithm. We empirically evaluate our accelerator on a state-of-the-art Intel Stratix 10 FPGA in terms of performance and power consumption and contrast it against existing solutions on general-purpose processors (CPUs). Finally, we propose a data movement-reducing technique where we compute geometric factors on the fly, which yields significant (700+ Gflop/s) single-precision performance and an upwards of 2x reduction in runtime for the local evaluation of the Laplace operator.
We end the paper by discussing the challenges and opportunities of using reconfigurable architecture in the future, particularly in the light of emerging (not yet available) technologies.
△ Less
Submitted 2 November, 2021; v1 submitted 27 August, 2021;
originally announced August 2021.
-
Neko: A Modern, Portable, and Scalable Framework for High-Fidelity Computational Fluid Dynamics
Authors:
Niclas Jansson,
Martin Karp,
Artur Podobas,
Stefano Markidis,
Philipp Schlatter
Abstract:
Recent trends and advancement in including more diverse and heterogeneous hardware in High-Performance Computing is challenging software developers in their pursuit for good performance and numerical stability. The well-known maxim "software outlives hardware" may no longer necessarily hold true, and developers are today forced to re-factor their codebases to leverage these powerful new systems. C…
▽ More
Recent trends and advancement in including more diverse and heterogeneous hardware in High-Performance Computing is challenging software developers in their pursuit for good performance and numerical stability. The well-known maxim "software outlives hardware" may no longer necessarily hold true, and developers are today forced to re-factor their codebases to leverage these powerful new systems. CFD is one of the many application domains affected. In this paper, we present Neko, a portable framework for high-order spectral element flow simulations. Unlike prior works, Neko adopts a modern object-oriented approach, allowing multi-tier abstractions of the solver stack and facilitating hardware backends ranging from general-purpose processors down to exotic vector processors and FPGAs. We show that Neko's performance and accuracy are comparable to NekRS, and thus on-par with Nek5000's successor on modern CPU machines. Furthermore, we develop a performance model, which we use to discuss challenges and opportunities for high-order solvers on emerging hardware.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers
Authors:
Martin Svedin,
Steven W. D. Chien,
Gibson Chikafa,
Niclas Jansson,
Artur Podobas
Abstract:
For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100, claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non…
▽ More
For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100, claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI benchmarks, and can we expect the A100 to deliver the application improvements we have grown used to with previous GPU generations? In this paper, we benchmark the A100 GPU and compare it to four previous generations of GPUs, with particular focus on empirically quantifying our derived performance expectations, and -- should those expectations be undelivered -- investigate whether the introduced data-movement features can offset any eventual loss in performance? We find that the A100 delivers less performance increase than previous generations for the well-known Rodinia benchmark suite; we show that some of these performance anomalies can be remedied through clever use of the new data-movement features, which we microbenchmark and demonstrate where (and more importantly, how) they should be used.
△ Less
Submitted 3 July, 2021; v1 submitted 9 June, 2021;
originally announced June 2021.
-
Accelerating Radiation Therapy Dose Calculation with Nvidia GPUs
Authors:
Felix Liu,
Niclas Jansson,
Artur Podobas,
Albin Fredriksson,
Stefano Markidis
Abstract:
Radiation Treatment Planning (RTP) is the process of planning the appropriate external beam radiotherapy to combat cancer in human patients. RTP is a complex and compute-intensive task, which often takes a long time (several hours) to compute. Reducing this time allows for higher productivity at clinics and more sophisticated treatment planning, which can materialize in better treatments. The stat…
▽ More
Radiation Treatment Planning (RTP) is the process of planning the appropriate external beam radiotherapy to combat cancer in human patients. RTP is a complex and compute-intensive task, which often takes a long time (several hours) to compute. Reducing this time allows for higher productivity at clinics and more sophisticated treatment planning, which can materialize in better treatments. The state-of-the-art in medical facilities uses general-purpose processors (CPUs) to perform many steps in the RTP process. In this paper, we explore the use of accelerators to reduce RTP calculating time. We focus on the step that calculates the dose using the Graphics Processing Unit (GPU), which we believe is an excellent candidate for this computation type. Next, we create a highly optimized implementation for a custom Sparse Matrix-Vector Multiplication (SpMV) that operates on numerical formats unavailable in state-of-the-art SpMV libraries (e.g., Ginkgo and cuSPARSE). We show that our implementation is several times faster than the baseline (up-to 4x) and has a higher operational intensity than similar (but different) versions such as Ginkgo and cuSPARSE.
△ Less
Submitted 19 September, 2021; v1 submitted 17 March, 2021;
originally announced March 2021.
-
High-Performance Spectral Element Methods on Field-Programmable Gate Arrays
Authors:
Martin Karp,
Artur Podobas,
Niclas Jansson,
Tobias Kenter,
Christian Plessl,
Philipp Schlatter,
Stefano Markidis
Abstract:
Improvements in computer systems have historically relied on two well-known observations: Moore's law and Dennard's scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the general-purpose architectures' comforts in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate…
▽ More
Improvements in computer systems have historically relied on two well-known observations: Moore's law and Dennard's scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the general-purpose architectures' comforts in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate Array (FPGA), which strikes a convenient balance between complexity and performance. In this paper, we study modern FPGAs' applicability in accelerating the Spectral Element Method (SEM) core to many computational fluid dynamics (CFD) applications. We design a custom SEM hardware accelerator operating in double-precision that we empirically evaluate on the latest Stratix 10 GX-series FPGAs and position its performance (and power-efficiency) against state-of-the-art systems such as ARM ThunderX2, NVIDIA Pascal/Volta/Ampere Tesla-series cards, and general-purpose manycore CPUs. Finally, we develop a performance model for our SEM-accelerator, which we use to project future FPGAs' performance and role to accelerate CFD applications, ultimately answering the question: what characteristics would a perfect FPGA for CFD applications have?
△ Less
Submitted 4 May, 2021; v1 submitted 26 October, 2020;
originally announced October 2020.
-
Optimization of Tensor-product Operations in Nekbone on GPUs
Authors:
Martin Karp,
Niclas Jansson,
Artur Podobas,
Philipp Schlatter,
Stefano Markidis
Abstract:
In the CFD solver Nek5000, the computation is dominated by the evaluation of small tensor operations. Nekbone is a proxy app for Nek5000 and has previously been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we continue this effort and optimize the main tensor-product operation in Nekbone further. Our optimization is done in CUDA and uses a different, 2D, thread structure to…
▽ More
In the CFD solver Nek5000, the computation is dominated by the evaluation of small tensor operations. Nekbone is a proxy app for Nek5000 and has previously been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we continue this effort and optimize the main tensor-product operation in Nekbone further. Our optimization is done in CUDA and uses a different, 2D, thread structure to make the computations layer by layer. This enables us to use loop unrolling as well as utilize registers and shared memory efficiently. Our implementation is then compared on both the Pascal and Volta GPU architectures to previous GPU versions of Nekbone as well as a measured roofline. The results show that our implementation outperforms previous GPU Nekbone implementations by 6-10%. Compared to the measured roofline, we obtain 77 - 92% of the peak performance for both Nvidia P100 and V100 GPUs for inputs with 1024 - 4096 elements and polynomial degree 9.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
CUBE: A scalable framework for large-scale industrial simulations
Authors:
Niclas Jansson,
Rahul Bale,
Keiji Onishi,
Makoto Tsubokura
Abstract:
Writing high performance solvers for engineering applications is a delicate task. These codes are often developed on an application to application basis, highly optimized to solve a certain problem. Here, we present our work on developing a general simulation framework for efficient computation of time resolved approximations of complex industrial flow problems - Complex Unified Building cubE meth…
▽ More
Writing high performance solvers for engineering applications is a delicate task. These codes are often developed on an application to application basis, highly optimized to solve a certain problem. Here, we present our work on developing a general simulation framework for efficient computation of time resolved approximations of complex industrial flow problems - Complex Unified Building cubE method (Cube). To address the challenges of emerging, modern supercomputers, suitable data structures and communication patterns are developed and incorporated into Cube. We use a Cartesian grid together with various immersed boundary methods to accurately capture moving, complex geometries. The asymmetric workload of the immersed boundary is balanced by a predictive dynamic load balancer, and a multithreaded halo-exchange algorithm is employed to efficiently overlap communication with computations. Our work also concerns efficient methods for handling the large amount of data produced by large-scale flow simulations, such as scalable parallel I/O, data compression and in-situ processing.
△ Less
Submitted 13 August, 2018;
originally announced August 2018.