Search | arXiv e-print repository

arXiv:2502.20072 [pdf, other]

A high-performance and portable implementation of the SISSO method for CPUs and GPUs

Authors: Sebastian Eibl, Yi Yao, Matthias Scheffler, Markus Rampp, Luca M. Ghiringhelli, Thomas A. R. Purcell

Abstract: SISSO (sure-independence screening and sparsifying operator) is an artificial intelligence (AI) method based on symbolic regression and compressed sensing widely used in materials science research. SISSO++ is its C++ implementation that employs MPI and OpenMP for parallelization, rendering it well-suited for high-performance computing (HPC) environments. As heterogeneous hardware becomes mainstrea… ▽ More SISSO (sure-independence screening and sparsifying operator) is an artificial intelligence (AI) method based on symbolic regression and compressed sensing widely used in materials science research. SISSO++ is its C++ implementation that employs MPI and OpenMP for parallelization, rendering it well-suited for high-performance computing (HPC) environments. As heterogeneous hardware becomes mainstream in the HPC and AI fields, we chose to port the SISSO++ code to GPUs using the Kokkos performance-portable library. Kokkos allows us to maintain a single codebase for both Nvidia and AMD GPUs, significantly reducing the maintenance effort. In this work, we summarize the necessary code changes we did to achieve hardware and performance portability. This is accompanied by performance benchmarks on Nvidia and AMD GPUs. We demonstrate the speedups obtained from using GPUs across the three most time-consuming parts of our code. △ Less

Submitted 27 February, 2025; originally announced February 2025.

arXiv:2502.00049 [pdf, other]

doi 10.18420/giljt2025_10

FORTE: An Open-Source System for Cost-Effective and Scalable Environmental Monitoring

Authors: Zoe Pfister, Michael Vierhauser, Alzbeta Medvedova, Marie Schroeder, Markus Rampp, Adrian Kronenberg, Albin Hammerle, Georg Wohlfahrt, Alexandra Jäger, Ruth Breu, Alois Simon

Abstract: Forests are an essential part of our biosphere, regulating climate, acting as a sink for greenhouse gases, and providing numerous other ecosystem services. However, they are negatively impacted by climatic stressors such as drought or heat waves. In this paper, we introduce FORTE, an open-source system for environmental monitoring with the aim of understanding how forests react to such stressors.… ▽ More Forests are an essential part of our biosphere, regulating climate, acting as a sink for greenhouse gases, and providing numerous other ecosystem services. However, they are negatively impacted by climatic stressors such as drought or heat waves. In this paper, we introduce FORTE, an open-source system for environmental monitoring with the aim of understanding how forests react to such stressors. It consists of two key components: (1) a wireless sensor network (WSN) deployed in the forest for data collection, and (2) a Data Infrastructure for data processing, storage, and visualization. The WSN contains a Central Unit capable of transmitting data to the Data Infrastructure via LTE-M and several spatially independent Satellites that collect data over large areas and transmit them wirelessly to the Central Unit. Our prototype deployments show that our solution is cost-effective compared to commercial solutions, energy-efficient with sensor nodes lasting for several months on a single charge, and reliable in terms of data quality. FORTE's flexible architecture makes it suitable for a wide range of environmental monitoring applications beyond forest monitoring. The contributions of this paper are three-fold. First, we describe the high-level requirements necessary for developing an environmental monitoring system. Second, we present an architecture and prototype implementation of the requirements by introducing our FORTE platform and demonstrating its effectiveness through multiple field tests. Lastly, we provide source code, documentation, and hardware design artifacts as part of our open-source repository. △ Less

Submitted 28 January, 2025; originally announced February 2025.

ACM Class: D.2.1; D.2.11

arXiv:2411.05009 [pdf, other]

A Study of Performance Portability in Plasma Physics Simulations

Authors: Josef Ruzicka, Christian Asch, Esteban Meneses, Markus Rampp, Erwin Laure

Abstract: The high-performance computing (HPC) community has recently seen a substantial diversification of hardware platforms and their associated programming models. From traditional multicore processors to highly specialized accelerators, vendors and tool developers back up the relentless progress of those architectures. In the context of scientific programming, it is fundamental to consider performance… ▽ More The high-performance computing (HPC) community has recently seen a substantial diversification of hardware platforms and their associated programming models. From traditional multicore processors to highly specialized accelerators, vendors and tool developers back up the relentless progress of those architectures. In the context of scientific programming, it is fundamental to consider performance portability frameworks, i.e., software tools that allow programmers to write code once and run it on different computer architectures without sacrificing performance. We report here on the benefits and challenges of performance portability using a field-line tracing simulation and a particle-in-cell code, two relevant applications in computational plasma physics with applications to magnetically-confined nuclear-fusion energy research. For these applications we report performance results obtained on four HPC platforms with server-class CPUs from Intel (Xeon) and AMD (EPYC), and high-end GPUs from Nvidia and AMD, including the latest Nvidia H100 GPU and the novel AMD Instinct MI300A APU. Our results show that both Kokkos and OpenMP are powerful tools to achieve performance portability and decent "out-of-the-box" performance, even for the very latest hardware platforms. For our applications, Kokkos provided performance portability to the broadest range of hardware architectures from different vendors. △ Less

Submitted 18 October, 2024; originally announced November 2024.

Comments: 15 pages, 8 figures, this is a pre-print to be published in the Latin America High Performance Computing Conference (CARLA) 2024 proceedings

MSC Class: 68Q85 ACM Class: D.1.3

arXiv:2109.10876 [pdf, other]

doi 10.1016/j.cpc.2023.108760

Code modernization strategies for short-range non-bonded molecular dynamics simulations

Authors: James Vance, Zhen-Hao Xu, Nikita Tretyakov, Torsten Stuehn, Markus Rampp, Sebastian Eibl, Christoph Junghans, André Brinkmann

Abstract: Modern HPC systems are increasingly relying on greater core counts and wider vector registers. Thus, applications need to be adapted to fully utilize these hardware capabilities. One class of applications that can benefit from this increase in parallelism are molecular dynamics simulations. In this paper, we describe our efforts at modernizing the ESPResSo++ molecular dynamics simulation package b… ▽ More Modern HPC systems are increasingly relying on greater core counts and wider vector registers. Thus, applications need to be adapted to fully utilize these hardware capabilities. One class of applications that can benefit from this increase in parallelism are molecular dynamics simulations. In this paper, we describe our efforts at modernizing the ESPResSo++ molecular dynamics simulation package by restructuring its particle data layout for efficient memory accesses and applying vectorization techniques to benefit the calculation of short-range non-bonded forces, which results in an overall three times speedup and serves as a baseline for further optimizations. We also implement fine-grained parallelism for multi-core CPUs through HPX, a C++ runtime system which uses lightweight threads and an asynchronous many-task approach to maximize concurrency. Our goal is to evaluate the performance of an HPX-based approach compared to the bulk-synchronous MPI-based implementation. This requires the introduction of an additional layer to the domain decomposition scheme that defines the task granularity. On spatially inhomogeneous systems, which impose a corresponding load-imbalance in traditional MPI-based approaches, we demonstrate that by choosing an optimal task size, the efficient work-stealing mechanisms of HPX can overcome the overhead of communication resulting in an overall 1.4 times speedup compared to the baseline MPI version. △ Less

Submitted 15 June, 2023; v1 submitted 22 September, 2021; originally announced September 2021.

Comments: 42 pages, 9 figures, SI

Journal ref: Comp. Phys. Comm. 290, 108760 (2023)

arXiv:2107.01104 [pdf, other]

doi 10.1016/j.cpc.2022.108406

An Efficient Particle Tracking Algorithm for Large-Scale Parallel Pseudo-Spectral Simulations of Turbulence

Authors: Cristian C. Lalescu, Bérenger Bramas, Markus Rampp, Michael Wilczek

Abstract: Particle tracking in large-scale numerical simulations of turbulent flows presents one of the major bottlenecks in parallel performance and scaling efficiency. Here, we describe a particle tracking algorithm for large-scale parallel pseudo-spectral simulations of turbulence which scales well up to billions of tracer particles on modern high-performance computing architectures. We summarize the sta… ▽ More Particle tracking in large-scale numerical simulations of turbulent flows presents one of the major bottlenecks in parallel performance and scaling efficiency. Here, we describe a particle tracking algorithm for large-scale parallel pseudo-spectral simulations of turbulence which scales well up to billions of tracer particles on modern high-performance computing architectures. We summarize the standard parallel methods used to solve the fluid equations in our hybrid MPI/OpenMP implementation. As the main focus, we describe the implementation of the particle tracking algorithm and document its computational performance. To address the extensive inter-process communication required by particle tracking, we introduce a task-based approach to overlap point-to-point communications with computations, thereby enabling improved resource utilization. We characterize the computational cost as a function of the number of particles tracked and compare it with the flow field computation, showing that the cost of particle tracking is very small for typical applications. △ Less

Submitted 30 May, 2022; v1 submitted 2 July, 2021; originally announced July 2021.

arXiv:1911.08394 [pdf, ps, other]

Evaluation of performance portability frameworks for the implementation of a particle-in-cell code

Authors: Victor Artigues, Katharina Kormann, Markus Rampp, Klaus Reuter

Abstract: This paper reports on an in-depth evaluation of the performance portability frameworks Kokkos and RAJA with respect to their suitability for the implementation of complex particle-in-cell (PIC) simulation codes, extending previous studies based on codes from other domains. At the example of a particle-in-cell model, we implemented the hotspot of the code in C++ and parallelized it using OpenMP, Op… ▽ More This paper reports on an in-depth evaluation of the performance portability frameworks Kokkos and RAJA with respect to their suitability for the implementation of complex particle-in-cell (PIC) simulation codes, extending previous studies based on codes from other domains. At the example of a particle-in-cell model, we implemented the hotspot of the code in C++ and parallelized it using OpenMP, OpenACC, CUDA, Kokkos, and RAJA, targeting multi-core (CPU) and graphics (GPU) processors. Both, Kokkos and RAJA appear mature, are usable for complex codes, and keep their promise to provide performance portability across different architectures. Comparing the obtainable performance on state-of-the art hardware, but also considering aspects such as code complexity, feature availability, and overall productivity, we finally draw the conclusion that the Kokkos framework would be suited best to tackle the massively parallel implementation of the full PIC model. △ Less

Submitted 19 November, 2019; originally announced November 2019.

arXiv:1903.00308 [pdf, ps, other]

doi 10.1177/1094342019834644

A massively parallel semi-Lagrangian solver for the six-dimensional Vlasov-Poisson equation

Authors: Katharina Kormann, Klaus Reuter, Markus Rampp

Abstract: This paper presents an optimized and scalable semi-Lagrangian solver for the Vlasov-Poisson system in six-dimensional phase space. Grid-based solvers of the Vlasov equation are known to give accurate results. At the same time, these solvers are challenged by the curse of dimensionality resulting in very high memory requirements, and moreover, requiring highly efficient parallelization schemes. In… ▽ More This paper presents an optimized and scalable semi-Lagrangian solver for the Vlasov-Poisson system in six-dimensional phase space. Grid-based solvers of the Vlasov equation are known to give accurate results. At the same time, these solvers are challenged by the curse of dimensionality resulting in very high memory requirements, and moreover, requiring highly efficient parallelization schemes. In this paper, we consider the 6d Vlasov-Poisson problem discretized by a split-step semi-Lagrangian scheme, using successive 1d interpolations on 1d stripes of the 6d domain. Two parallelization paradigms are compared, a remapping scheme and a classical domain decomposition approach applied to the full 6d problem. From numerical experiments, the latter approach is found to be superior in the massively parallel case in various respects. We address the challenge of artificial time step restrictions due to the decomposition of the domain by introducing a blocked one-sided communication scheme for the purely electrostatic case and a rotating mesh for the case with a constant magnetic field. In addition, we propose a pipelining scheme that enables to hide the costs for the halo communication between neighbor processes efficiently behind useful computation. Parallel scalability on up to 65k processes is demonstrated for benchmark problems on a supercomputer. △ Less

Submitted 1 March, 2019; originally announced March 2019.

arXiv:1511.04203 [pdf, ps, other]

doi 10.13182/FST15-154

GPEC, a real-time capable Tokamak equilibrium code

Authors: Markus Rampp, Roland Preuss, Rainer Fischer, the ASDEX Upgrade Team

Abstract: A new parallel equilibrium reconstruction code for tokamak plasmas is presented. GPEC allows to compute equilibrium flux distributions sufficiently accurate to derive parameters for plasma control within 1 ms of runtime which enables real-time applications at the ASDEX Upgrade experiment (AUG) and other machines with a control cycle of at least this size. The underlying algorithms are based on the… ▽ More A new parallel equilibrium reconstruction code for tokamak plasmas is presented. GPEC allows to compute equilibrium flux distributions sufficiently accurate to derive parameters for plasma control within 1 ms of runtime which enables real-time applications at the ASDEX Upgrade experiment (AUG) and other machines with a control cycle of at least this size. The underlying algorithms are based on the well-established offline-analysis code CLISTE, following the classical concept of iteratively solving the Grad-Shafranov equation and feeding in diagnostic signals from the experiment. The new code adopts a hybrid parallelization scheme for computing the equilibrium flux distribution and extends the fast, shared-memory-parallel Poisson solver which we have described previously by a distributed computation of the individual Poisson problems corresponding to different basis functions. The code is based entirely on open-source software components and runs on standard server hardware and software environments. The real-time capability of GPEC is demonstrated by performing an offline-computation of a sequence of 1000 flux distributions which are taken from one second of operation of a typical AUG discharge and deriving the relevant control parameters with a time resolution of a millisecond. On current server hardware the new code allows employing a grid size of 32x64 zones for the spatial discretization and up to 15 basis functions. It takes into account about 90 diagnostic signals while using up to 4 equilibrium iterations and computing more than 20 plasma-control parameters, including the computationally expensive safety-factor q on at least 4 different levels of the normalized flux. △ Less

Submitted 25 May, 2016; v1 submitted 13 November, 2015; originally announced November 2015.

Comments: minor typos corrected and reference updated, matches published version

Journal ref: Fusion Science and Technology 70(1), 2016, 1-13

arXiv:1310.1485 [pdf, other]

Porting Large HPC Applications to GPU Clusters: The Codes GENE and VERTEX

Authors: Tilman Dannert, Andreas Marek, Markus Rampp

Abstract: We have developed GPU versions for two major high-performance-computing (HPC) applications originating from two different scientific domains. GENE is a plasma microturbulence code which is employed for simulations of nuclear fusion plasmas. VERTEX is a neutrino-radiation hydrodynamics code for "first principles"-simulations of core-collapse supernova explosions. The codes are considered state of t… ▽ More We have developed GPU versions for two major high-performance-computing (HPC) applications originating from two different scientific domains. GENE is a plasma microturbulence code which is employed for simulations of nuclear fusion plasmas. VERTEX is a neutrino-radiation hydrodynamics code for "first principles"-simulations of core-collapse supernova explosions. The codes are considered state of the art in their respective scientific domains, both concerning their scientific scope and functionality as well as the achievable compute performance, in particular parallel scalability on all relevant HPC platforms. GENE and VERTEX were ported by us to HPC cluster architectures with two NVidia Kepler GPUs mounted in each node in addition to two Intel Xeon CPUs of the Sandy Bridge family. On such platforms we achieve up to twofold gains in the overall application performance in the sense of a reduction of the time to solution for a given setup with respect to a pure CPU cluster. The paper describes our basic porting strategies and benchmarking methodology, and details the main algorithmic and technical challenges we faced on the new, heterogeneous architecture. △ Less

Submitted 5 October, 2013; originally announced October 2013.

Comments: 10 pages, accepted for publication in ParCo 2013

Showing 1–9 of 9 results for author: Rampp, M