-
Energy-aware operation of HPC systems in Germany
Authors:
Estela Suarez,
Hendryk Bockelmann,
Norbert Eicker,
Jan Eitzinger,
Salem El Sayed,
Thomas Fieseler,
Martin Frank,
Peter Frech,
Pay Giesselmann,
Daniel Hackenberg,
Georg Hager,
Andreas Herten,
Thomas Ilsche,
Bastian Koller,
Erwin Laure,
Cristina Manzano,
Sebastian Oeste,
Michael Ott,
Klaus Reuter,
Ralf Schneider,
Kay Thust,
Benedikt von St. Vieth
Abstract:
High-Performance Computing (HPC) systems are among the most energy-intensive scientific facilities, with electric power consumption reaching and often exceeding 20 megawatts per installation. Unlike other major scientific infrastructures such as particle accelerators or high-intensity light sources, which are few around the world, the number and size of supercomputers are continuously increasing.…
▽ More
High-Performance Computing (HPC) systems are among the most energy-intensive scientific facilities, with electric power consumption reaching and often exceeding 20 megawatts per installation. Unlike other major scientific infrastructures such as particle accelerators or high-intensity light sources, which are few around the world, the number and size of supercomputers are continuously increasing. Even if every new system generation is more energy efficient than the previous one, the overall growth in size of the HPC infrastructure, driven by a rising demand for computational capacity across all scientific disciplines, and especially by artificial intelligence workloads (AI), rapidly drives up the energy demand. This challenge is particularly significant for HPC centers in Germany, where high electricity costs, stringent national energy policies, and a strong commitment to environmental sustainability are key factors. This paper describes various state-of-the-art strategies and innovations employed to enhance the energy efficiency of HPC systems within the national context. Case studies from leading German HPC facilities illustrate the implementation of novel heterogeneous hardware architectures, advanced monitoring infrastructures, high-temperature cooling solutions, energy-aware scheduling, and dynamic power management, among other optimizations. By reviewing best practices and ongoing research, this paper aims to share valuable insight with the global HPC community, motivating the pursuit of more sustainable and energy-efficient HPC operations.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
PolypNextLSTM: A lightweight and fast polyp video segmentation network using ConvNext and ConvLSTM
Authors:
Debayan Bhattacharya,
Konrad Reuter,
Finn Behrendt,
Lennart Maack,
Sarah Grube,
Alexander Schlaefer
Abstract:
Commonly employed in polyp segmentation, single image UNet architectures lack the temporal insight clinicians gain from video data in diagnosing polyps. To mirror clinical practices more faithfully, our proposed solution, PolypNextLSTM, leverages video-based deep learning, harnessing temporal information for superior segmentation performance with the least parameter overhead, making it possibly su…
▽ More
Commonly employed in polyp segmentation, single image UNet architectures lack the temporal insight clinicians gain from video data in diagnosing polyps. To mirror clinical practices more faithfully, our proposed solution, PolypNextLSTM, leverages video-based deep learning, harnessing temporal information for superior segmentation performance with the least parameter overhead, making it possibly suitable for edge devices. PolypNextLSTM employs a UNet-like structure with ConvNext-Tiny as its backbone, strategically omitting the last two layers to reduce parameter overhead. Our temporal fusion module, a Convolutional Long Short Term Memory (ConvLSTM), effectively exploits temporal features. Our primary novelty lies in PolypNextLSTM, which stands out as the leanest in parameters and the fastest model, surpassing the performance of five state-of-the-art image and video-based deep learning models. The evaluation of the SUN-SEG dataset spans easy-to-detect and hard-to-detect polyp scenarios, along with videos containing challenging artefacts like fast motion and occlusion. Comparison against 5 image-based and 5 video-based models demonstrates PolypNextLSTM's superiority, achieving a Dice score of 0.7898 on the hard-to-detect polyp test set, surpassing image-based PraNet (0.7519) and video-based PNSPlusNet (0.7486). Notably, our model excels in videos featuring complex artefacts such as ghosting and occlusion. PolypNextLSTM, integrating pruned ConvNext-Tiny with ConvLSTM for temporal fusion, not only exhibits superior segmentation performance but also maintains the highest frames per speed among evaluated models. Access code here https://github.com/mtec-tuhh/PolypNextLSTM
△ Less
Submitted 28 February, 2024; v1 submitted 18 February, 2024;
originally announced February 2024.
-
Evaluation of performance portability frameworks for the implementation of a particle-in-cell code
Authors:
Victor Artigues,
Katharina Kormann,
Markus Rampp,
Klaus Reuter
Abstract:
This paper reports on an in-depth evaluation of the performance portability frameworks Kokkos and RAJA with respect to their suitability for the implementation of complex particle-in-cell (PIC) simulation codes, extending previous studies based on codes from other domains. At the example of a particle-in-cell model, we implemented the hotspot of the code in C++ and parallelized it using OpenMP, Op…
▽ More
This paper reports on an in-depth evaluation of the performance portability frameworks Kokkos and RAJA with respect to their suitability for the implementation of complex particle-in-cell (PIC) simulation codes, extending previous studies based on codes from other domains. At the example of a particle-in-cell model, we implemented the hotspot of the code in C++ and parallelized it using OpenMP, OpenACC, CUDA, Kokkos, and RAJA, targeting multi-core (CPU) and graphics (GPU) processors. Both, Kokkos and RAJA appear mature, are usable for complex codes, and keep their promise to provide performance portability across different architectures. Comparing the obtainable performance on state-of-the art hardware, but also considering aspects such as code complexity, feature availability, and overall productivity, we finally draw the conclusion that the Kokkos framework would be suited best to tackle the massively parallel implementation of the full PIC model.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis
Authors:
Luka Stanisic,
Klaus Reuter
Abstract:
This paper reports on the design and implementation of the HPC performance monitoring system deployed to continuously monitor performance metrics of all jobs on the HPC systems at the Max Planck Computing and Data Facility (MPCDF). Thereby it reveals important information to various stakeholders, in particular to users, application support, system administrators, and management. On each compute no…
▽ More
This paper reports on the design and implementation of the HPC performance monitoring system deployed to continuously monitor performance metrics of all jobs on the HPC systems at the Max Planck Computing and Data Facility (MPCDF). Thereby it reveals important information to various stakeholders, in particular to users, application support, system administrators, and management. On each compute node, hardware and software performance monitoring data is collected by our newly developed lightweight open-source hpcmd middleware which builds upon standard Linux tools. The data is transported via rsyslog, and aggregated and processed by a Splunk system, enabling detailed per-cluster and per-job interactive analysis in a web browser. Additionally, performance reports are provided to the users as PDF files. Finally, we report on practical experience and benefits from large-scale deployments on MPCDF HPC systems, demonstrating how our solution can be useful to any HPC center.
△ Less
Submitted 25 September, 2019;
originally announced September 2019.
-
A massively parallel semi-Lagrangian solver for the six-dimensional Vlasov-Poisson equation
Authors:
Katharina Kormann,
Klaus Reuter,
Markus Rampp
Abstract:
This paper presents an optimized and scalable semi-Lagrangian solver for the Vlasov-Poisson system in six-dimensional phase space. Grid-based solvers of the Vlasov equation are known to give accurate results. At the same time, these solvers are challenged by the curse of dimensionality resulting in very high memory requirements, and moreover, requiring highly efficient parallelization schemes. In…
▽ More
This paper presents an optimized and scalable semi-Lagrangian solver for the Vlasov-Poisson system in six-dimensional phase space. Grid-based solvers of the Vlasov equation are known to give accurate results. At the same time, these solvers are challenged by the curse of dimensionality resulting in very high memory requirements, and moreover, requiring highly efficient parallelization schemes. In this paper, we consider the 6d Vlasov-Poisson problem discretized by a split-step semi-Lagrangian scheme, using successive 1d interpolations on 1d stripes of the 6d domain. Two parallelization paradigms are compared, a remapping scheme and a classical domain decomposition approach applied to the full 6d problem. From numerical experiments, the latter approach is found to be superior in the massively parallel case in various respects. We address the challenge of artificial time step restrictions due to the decomposition of the domain by introducing a blocked one-sided communication scheme for the purely electrostatic case and a rotating mesh for the case with a constant magnetic field. In addition, we propose a pipelining scheme that enables to hide the costs for the halo communication between neighbor processes efficiently behind useful computation. Parallel scalability on up to 65k processes is demonstrated for benchmark problems on a supercomputer.
△ Less
Submitted 1 March, 2019;
originally announced March 2019.
-
Optimizations of the Eigensolvers in the ELPA Library
Authors:
P. Kus,
A. Marek,
S. S. Koecher,
H. -H. Kowalski,
C. Carbogno,
Ch. Scheurer,
K. Reuter,
M. Scheffler,
H. Lederer
Abstract:
The solution of (generalized) eigenvalue problems for symmetric or Hermitian matrices is a common subtask of many numerical calculations in electronic structure theory or materials science. Solving the eigenvalue problem can easily amount to a sizeable fraction of the whole numerical calculation. For researchers in the field of computational materials science, an efficient and scalable solution of…
▽ More
The solution of (generalized) eigenvalue problems for symmetric or Hermitian matrices is a common subtask of many numerical calculations in electronic structure theory or materials science. Solving the eigenvalue problem can easily amount to a sizeable fraction of the whole numerical calculation. For researchers in the field of computational materials science, an efficient and scalable solution of the eigenvalue problem is thus of major importance. The ELPA-library is a well-established dense direct eigenvalue solver library, which has proven to be very efficient and scalable up to very large core counts. In this paper, we describe the latest optimizations of the ELPA-library for new HPC architectures of the Intel Skylake processor family with an AVX-512 SIMD instruction set, or for HPC systems accelerated with recent GPUs. We also describe a complete redesign of the API in a modern modular way, which, apart from a much simpler and more flexible usability, leads to a new path to access system-specific performance optimizations. In order to ensure optimal performance for a particular scientific setting or a specific HPC system, the new API allows the user to influence in straightforward way the internal details of the algorithms and of performance-critical parameters used in the ELPA-library. On top of that, we introduced an autotuning functionality, which allows for finding the best settings in a self-contained automated way. In situations where many eigenvalue problems with similar settings have to be solved consecutively, the autotuning process of the ELPA-library can be done "on-the-fly". Practical applications from materials science which rely on so-called self-consistency iterations can profit from the autotuning. On some examples of scientific interest, simulated with the FHI-aims application, the advantages of the latest optimizations of the ELPA-library are demonstrated.
△ Less
Submitted 3 November, 2018;
originally announced November 2018.