Skip to main content

Showing 1–50 of 74 results for author: Wellein, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2412.08792  [pdf, other

    cs.DC cs.PF

    Analytic Roofline Modeling and Energy Analysis of LULESH Proxy Application on Multi-Core Clusters

    Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein

    Abstract: We present a thorough performance and energy consumption analysis of the LULESH proxy application in its OpenMP and MPI variants on two different clusters based on Intel Ice Lake (ICL) and Sapphire Rapids (SPR) CPUs. We first study the strong scaling and power consumption characteristics of the six hot spot functions in the code on the node level, with a special focus on memory bandwidth utilizati… ▽ More

    Submitted 11 December, 2024; originally announced December 2024.

    Comments: 10 pages, 11 figures, 4 tables

  2. arXiv:2409.08108  [pdf, other

    cs.PF cs.DC

    Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa

    Authors: Jan Laukemann, Georg Hager, Gerhard Wellein

    Abstract: With Nvidia's release of the Grace Superchip, all three big semiconductor companies in HPC (AMD, Intel, Nvidia) are currently competing in the race for the best CPU. In this work we analyze the performance of these state-of-the-art CPUs and create an accurate in-core performance model for their microarchitectures Zen 4, Golden Cove, and Neoverse V2, extending the Open Source Architecture Code Anal… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: 5 pages, 4 figures

  3. arXiv:2405.12525  [pdf, other

    cs.DC cs.PF

    Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

    Authors: Dane C. Lacey, Christie L. Alappat, Florian Lange, Georg Hager, Holger Fehske, Gerhard Wellein

    Abstract: Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes. Due to the heavy strain on the main memory interface from loading the sparse matrix and the possibly irregular memory access pattern, SpMV typically exhibits low arithmetic intensity. Repeating these products multiple times with the same matrix is required in many algorithms. This so-called matrix power kernel (MPK) p… ▽ More

    Submitted 22 May, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

    Comments: 15 pages, 12 figures, 5 tables; added affiliation & extended acknowledgment

  4. Alya towards Exascale: Optimal OpenACC Performance of the Navier-Stokes Finite Element Assembly on GPUs

    Authors: Herbert Owen, Dominik Ernst, Thomas Gruber, Oriol Lemkuhl, Guillaume Houzeaux, Lucas Gasparino, Gerhard Wellein

    Abstract: This paper addresses the challenge of providing portable and highly efficient code structures for CPU and GPU architectures. We choose the assembly of the right-hand term in the incompressible flow module of the High-Performance Computational Mechanics code Alya, which is one of the two CFD codes in the Unified European Benchmark Suite. Starting from an efficient CPU-code and a related OpenACC-por… ▽ More

    Submitted 22 January, 2024; originally announced March 2024.

  5. CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion

    Authors: Jan Laukemann, Thomas Gruber, Georg Hager, Dossay Oryspayev, Gerhard Wellein

    Abstract: In this paper we analyze the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite on recent Intel Xeon "Ice Lake" and "Sapphire Rapids" server CPUs. We observe peculiar breakdowns in performance when the number of processes is prime. Investigating this effect, we create first-principles data traffic models for each of the stencil-like hotspot loops. With application measur… ▽ More

    Submitted 17 May, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: 19 pages including artifact appendix; 11 figures, 1 table; numerous corrections, esp. in Table 1

  6. arXiv:2310.05701  [pdf, other

    cs.DC physics.comp-ph

    Physical Oscillator Model for Supercomputing

    Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein

    Abstract: A parallel program together with the parallel hardware it is running on is not only a vehicle to solve numerical problems, it is also a complex system with interesting dynamical behavior: resynchronization and desynchronization of parallel processes, propagating phases of idleness, and the peculiar effects of noise and system topology are just a few examples. We propose a physical oscillator model… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: 5 pages, 2 figures

  7. SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study

    Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein

    Abstract: In this work, fundamental performance, power, and energy characteristics of the full SPEChpc 2021 benchmark suite are assessed on two different clusters based on Intel Ice Lake and Sapphire Rapids CPUs using the MPI-only codes' variants. We use memory bandwidth, data volume, and scalability metrics in order to categorize the benchmarks and pinpoint relevant performance and scalability bottlenecks… ▽ More

    Submitted 14 September, 2023; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: 9 pages, 6 figures; corrected links to system docs

  8. arXiv:2309.02228  [pdf, other

    math.NA cs.DC

    Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs

    Authors: Christie Alappat, Jonas Thies, Georg Hager, Holger Fehske, Gerhard Wellein

    Abstract: Sparse linear iterative solvers are essential for many large-scale simulations. Much of the runtime of these solvers is often spent in the implicit evaluation of matrix polynomials via a sequence of sparse matrix-vector products. A variety of approaches has been proposed to make these polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial preconditioners or s-step Krylov me… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: 25 pages, 11 figures, 3 tables

  9. arXiv:2302.14660  [pdf, other

    physics.chem-ph cs.PF physics.comp-ph

    MD-Bench: Engineering the in-core performance of short-range molecular dynamics kernels from state-of-the-art simulation packages

    Authors: Rafael Ravedutti Lucio Machado, Jan Eitzinger, Jan Laukemann, Georg Hager, Harald Köstler, Gerhard Wellein

    Abstract: Molecular dynamics (MD) simulations provide considerable benefits for the investigation and experimentation of systems at atomic level. Their usage is widespread into several research fields, but their system size and timescale are also crucially limited by the computing power they can make use of. Performance engineering of MD kernels is therefore important to understand their bottlenecks and poi… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

    Comments: 17 pages, 10 figures, 5 tables. arXiv admin note: text overlap with arXiv:2207.13094

  10. Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

    Authors: Ayesha Afzal, Georg Hager, Stefano Markidis, Gerhard Wellein

    Abstract: Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate inject… ▽ More

    Submitted 24 February, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

    Comments: 18 pages, 14 figures, 7 tables. Corrected Fig. 4 layout

  11. arXiv:2207.13094  [pdf, other

    physics.comp-ph cs.PF

    MD-Bench: A generic proxy-app toolbox for state-of-the-art molecular dynamics algorithms

    Authors: Rafael Ravedutti Lucio Machado, Jan Eitzinger, Harald Köstler, Gerhard Wellein

    Abstract: Proxy-apps, or mini-apps, are simple self-contained benchmark codes with performance-relevant kernels extracted from real applications. Initially used to facilitate software-hardware co-design, they are a crucial ingredient for serious performance engineering, especially when dealing with large-scale production codes. MD-Bench is a new proxy-app in the area of classical short-range molecular dynam… ▽ More

    Submitted 26 July, 2022; originally announced July 2022.

    Comments: 12 Pages, 2 figures, submitted to PPAM22

  12. Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

    Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein, Stefano Markidis

    Abstract: This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time p… ▽ More

    Submitted 27 May, 2022; originally announced May 2022.

    Comments: 12 pages, 9 figures, 1 table

  13. The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs

    Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein

    Abstract: The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often the starting point of optimization efforts. However, coupling analytic models across the system hierarchy (socket, node, network) fails to encompass the intricate interplay between the… ▽ More

    Submitted 9 May, 2022; originally announced May 2022.

    Comments: 13 pages, 7 figures, 6 tables

  14. arXiv:2205.01598  [pdf, other

    math.NA cs.DC cs.PF

    Level-based Blocking for Sparse Matrices: Sparse Matrix-Power-Vector Multiplication

    Authors: Christie L. Alappat, Georg Hager, Olaf Schenk, Gerhard Wellein

    Abstract: The multiplication of a sparse matrix with a dense vector (SpMV) is a key component in many numerical schemes and its performance is known to be severely limited by main memory access. Several numerical schemes require the multiplication of a sparse matrix polynomial with a dense vector, which is typically implemented as a sequence of SpMVs. This results in low performance and ignores the potentia… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: 18 pages, 19 figures, 3 tables

  15. Analytical Performance Estimation during Code Generation on Modern GPUs

    Authors: Dominik Ernst, Markus Holzer, Georg Hager, Matthias Knorr, Gerhard Wellein

    Abstract: Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. We propose an alternative to time-intensive autotuning, scenario-specific performance models, or black-box ma… ▽ More

    Submitted 29 April, 2022; originally announced April 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2107.01143

  16. Opening the Black Box: Performance Estimation during Code Generation for GPUs

    Authors: Dominik Ernst, Georg Hager, Markus Holzer, Matthias Knorr, Gerhard Wellein

    Abstract: Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scena… ▽ More

    Submitted 2 July, 2021; originally announced July 2021.

    ACM Class: C.4

  17. Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact

    Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein

    Abstract: Most distributed-memory bulk-synchronous parallel programs in HPC assume that compute resources are available continuously and homogeneously across the allocated set of compute nodes. However, long one-off delays on individual processes can cause global disturbances, so-called idle waves, by rippling through the system. This process is mainly governed by the communication topology of the underlyin… ▽ More

    Submitted 4 March, 2021; originally announced March 2021.

    Comments: 19 pages, 10 figures, 2 tables

  18. arXiv:2103.03013  [pdf, other

    cs.PF cs.DC hep-lat

    ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

    Authors: Christie Alappat, Nils Meyer, Jan Laukemann, Thomas Gruber, Georg Hager, Gerhard Wellein, Tilo Wettig

    Abstract: The A64FX CPU is arguably the most powerful Arm-based processor design to date. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis of the A64FX used in… ▽ More

    Submitted 30 July, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

    Comments: 32 pages, 25 figures, 6 tables

  19. arXiv:2011.00243  [pdf, other

    cs.DC cs.PF

    An analytic performance model for overlapping execution of memory-bound loop kernels on multicore CPUs

    Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein

    Abstract: Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or task-based programming models can lead to thread desynchronization. Hence, the simplifying assumption that all cores execute the same loop can not be upheld. Motivated b… ▽ More

    Submitted 31 October, 2020; originally announced November 2020.

    Comments: 10 pages, 9 figures

  20. Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

    Authors: Christie L. Alappat, Jan Laukemann, Thomas Gruber, Georg Hager, Gerhard Wellein, Nils Meyer, Tilo Wettig

    Abstract: The A64FX CPU powers the current number one supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, we construct the Execution-Cache-Memory (ECM) performanc… ▽ More

    Submitted 29 September, 2020; originally announced September 2020.

    Comments: 6 pages, 5 figures, 3 tables

  21. Multiway $p$-spectral graph cuts on Grassmann manifolds

    Authors: Dimosthenis Pasadakis, Christie Louis Alappat, Olaf Schenk, Gerhard Wellein

    Abstract: Nonlinear reformulations of the spectral clustering method have gained a lot of recent attention due to their increased numerical benefits and their solid mathematical background. We present a novel direct multiway spectral clustering algorithm in the $p$-norm, for $p \in (1, 2]$. The problem of computing multiple eigenvectors of the graph $p$-Laplacian, a nonlinear generalization of the standard… ▽ More

    Submitted 26 November, 2021; v1 submitted 30 August, 2020; originally announced August 2020.

    MSC Class: 68R10 (Primary); 90C27 (Secondary) ACM Class: G.2.1; G.2.2

    Journal ref: Mach Learn (2021)

  22. Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors

    Authors: Christie L. Alappat, Johannes Hofmann, Georg Hager, Holger Fehske, Alan R. Bishop, Gerhard Wellein

    Abstract: Hardware platforms in high performance computing are constantly getting more complex to handle even when considering multicore CPUs alone. Numerous features and configuration options in the hardware and the software environment that are relevant for performance are not even known to most application users or developers. Microbenchmarks, i.e., simple codes that fathom a particular aspect of the har… ▽ More

    Submitted 12 February, 2020; v1 submitted 9 February, 2020; originally announced February 2020.

    Comments: 19 pages, 9 figures, 3 tables. Corrected affiliations and acknowledgments

  23. Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs

    Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein

    Abstract: Analytic, first-principles performance modeling of distributed-memory parallel codes is notoriously imprecise. Even for applications with extremely regular and homogeneous compute-communicate phases, simply adding communication time to computation time does often not yield a satisfactory prediction of parallel runtime due to deviations from the expected simple lockstep pattern caused by system noi… ▽ More

    Submitted 7 February, 2020; originally announced February 2020.

    Comments: 18 pages, 8 figures

  24. Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

    Authors: Jan Laukemann, Julian Hammer, Georg Hager, Gerhard Wellein

    Abstract: Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel runtime, the critical path defines an upper bound. Such predictions are an essential part of analytic (i.e., white-box) performance models like the Roofline and… ▽ More

    Submitted 21 October, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

    Comments: 6 pages, 3 figures

  25. arXiv:1907.06487  [pdf, other

    cs.DC cs.PF

    A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication

    Authors: Christie L. Alappat, Georg Hager, Olaf Schenk, Jonas Thies, Achim Basermann, Alan R. Bishop, Holger Fehske, Gerhard Wellein

    Abstract: The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today's multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data… ▽ More

    Submitted 15 July, 2019; originally announced July 2019.

    Comments: 40 pages, 23 figures

  26. arXiv:1907.00048  [pdf, ps, other

    cs.DC cs.AR cs.PF

    Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors

    Authors: Johannes Hofmann, Christie L. Alappat, Georg Hager, Dietmar Fey, Gerhard Wellein

    Abstract: We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises the loop code, problem sizes, and other runtime parameters, while a machine model is an abstraction of all performance-relevant properties of a CPU. We introduc… ▽ More

    Submitted 1 July, 2019; originally announced July 2019.

    Comments: 12 pages, 7 figures

  27. Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT

    Authors: Julian Hornich, Julian Hammer, Georg Hager, Thomas Gruber, Gerhard Wellein

    Abstract: Stencil algorithms have been receiving considerable interest in HPC research for decades. The techniques used to approach multi-core stencil performance modeling and engineering span basic runtime measurements, elaborate performance models, detailed hardware counter analysis, and thorough scaling behavior evaluation. Due to the plurality of approaches and stencil patterns, we set out to develop a… ▽ More

    Submitted 2 July, 2019; v1 submitted 19 June, 2019; originally announced June 2019.

  28. Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study

    Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein

    Abstract: Analytic, first-principles performance modeling of distributed-memory applications is difficult due to a wide spectrum of random disturbances caused by the application and the system. These disturbances (commonly called "noise") destroy the assumptions of regularity that one usually employs when constructing simple analytic models. Despite numerous efforts to quantify, categorize, and reduce such… ▽ More

    Submitted 28 August, 2019; v1 submitted 25 May, 2019; originally announced May 2019.

    Comments: 10 pages, 9 figures; title changed

  29. Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs

    Authors: Dominik Ernst, Georg Hager, Jonas Thies, Gerhard Wellein

    Abstract: General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA's current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in t… ▽ More

    Submitted 18 February, 2020; v1 submitted 8 May, 2019; originally announced May 2019.

    Comments: 12 pages, 22 figures. Extended version of arXiv:1905.03136v1 for journal submission

  30. Analytic Performance Modeling and Analysis of Detailed Neuron Simulations

    Authors: Francesco Cremonesi, Georg Hager, Gerhard Wellein, Felix Schürmann

    Abstract: Big science initiatives are trying to reconstruct and model the brain by attempting to simulate brain tissue at larger scales and with increasingly more biological detail than previously thought possible. The exponential growth of parallel computer performance has been supporting these developments, and at the same time maintainers of neuroscientific simulation code have strived to optimally and e… ▽ More

    Submitted 16 January, 2019; originally announced January 2019.

    Comments: 18 pages, 6 figures, 15 tables

  31. Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

    Authors: Jan Laukemann, Julian Hammer, Johannes Hofmann, Georg Hager, Gerhard Wellein

    Abstract: An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions are an indispensable component of analytical performance models, such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a deep understandin… ▽ More

    Submitted 10 October, 2018; v1 submitted 4 September, 2018; originally announced September 2018.

    Comments: 11 pages, 4 figures, 7 tables

  32. arXiv:1803.02156  [pdf, ps, other

    cs.MS cs.PF physics.comp-ph

    Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs

    Authors: Moritz Kreutzer, Georg Hager, Dominik Ernst, Holger Fehske, Alan R. Bishop, Gerhard Wellein

    Abstract: Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address sp… ▽ More

    Submitted 6 March, 2018; originally announced March 2018.

    Comments: 18 pages, 8 figures

  33. Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis

    Authors: Markus Wittmann, Viktor Haag, Thomas Zeiser, Harald Köstler, Gerhard Wellein

    Abstract: Lattice Boltzmann methods (LBM) are an important part of current computational fluid dynamics (CFD). They allow easy implementations and boundary handling. However, competitive time to solution not only depends on the choice of a reasonable method, but also on an efficient implementation on modern hardware. Hence, performance optimization has a long history in the lattice Boltzmann community. A va… ▽ More

    Submitted 30 November, 2017; originally announced November 2017.

    Comments: preprint, submitted to Computer & Fluids Special Issue DSFD2017

    Journal ref: Computers & Fluids, 2018

  34. Validation of hardware events for successful performance pattern identification in High Performance Computing

    Authors: Thomas Röhl, Jan Eitzinger, Georg Hager, Gerhard Wellein

    Abstract: Hardware performance monitoring (HPM) is a crucial ingredient of performance analysis tools. While there are interfaces like LIKWID, PAPI or the kernel interface perf\_event which provide HPM access with some additional features, many higher level tools combine event counts with results retrieved from other sources like function call traces to derive (semi-)automatic performance advice. However, a… ▽ More

    Submitted 11 October, 2017; originally announced October 2017.

    Journal ref: Tools for High Performance Computing 2015

  35. arXiv:1708.02030  [pdf, ps, other

    cs.DC

    CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

    Authors: Faisal Shahzad, Jonas Thies, Moritz Kreutzer, Thomas Zeiser, Georg Hager, Gerhard Wellein

    Abstract: In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but… ▽ More

    Submitted 7 August, 2017; originally announced August 2017.

  36. LIKWID Monitoring Stack: A flexible framework enabling job specific performance monitoring for the masses

    Authors: Thomas Röhl, Jan Eitzinger, Georg Hager, Gerhard Wellein

    Abstract: System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathologic… ▽ More

    Submitted 4 August, 2017; originally announced August 2017.

    Comments: 4 pages, 4 figures. Accepted for HPCMASPA 2017, the Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications, held in conjunction with IEEE Cluster 2017, Honolulu, HI, September 5, 2017

  37. arXiv:1702.07554  [pdf, ps, other

    cs.PF

    An analysis of core- and chip-level architectural features in four generations of Intel server processors

    Authors: Johannes Hofmann, Georg Hager, Gerhard Wellein, Dietmar Fey

    Abstract: This paper presents a survey of architectural features among four generations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broad- well) with a focus on performance with floating point workloads. Starting on the core level and going down the memory hierarchy we cover instruction throughput for floating-point instructions, L1 cache, address generation capabilities, core clock s… ▽ More

    Submitted 24 February, 2017; originally announced February 2017.

  38. Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

    Authors: Julian Hammer, Jan Eitzinger, Georg Hager, Gerhard Wellein

    Abstract: Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern architectures is close to impossible. Analytic loop performance modeling is a useful way to understand the relevant bottlenecks of code execution based on simple mach… ▽ More

    Submitted 13 January, 2017; originally announced February 2017.

    Comments: 22 pages, 5 figures

  39. arXiv:1609.01507  [pdf, other

    cs.DC astro-ph.IM physics.comp-ph physics.flu-dyn

    Extreme Scale-out SuperMUC Phase 2 - lessons learned

    Authors: Nicolay Hammer, Ferdinand Jamitzky, Helmut Satzger, Momme Allalen, Alexander Block, Anupam Karmakar, Matthias Brehm, Reinhold Bader, Luigi Iapichino, Antonio Ragagnin, Vasilios Karakasis, Dieter Kranzlmüller, Arndt Bode, Herbert Huber, Martin Kühn, Rui Machado, Daniel Grünewald, Philipp V. F. Edelmann, Friedrich K. Röpke, Markus Wittmann, Thomas Zeiser, Gerhard Wellein, Gerald Mathias, Magnus Schwörer, Konstantin Lorenzen , et al. (14 additional authors not shown)

    Abstract: In spring 2015, the Leibniz Supercomputing Centre (Leibniz-Rechenzentrum, LRZ), installed their new Peta-Scale System SuperMUC Phase2. Selected users were invited for a 28 day extreme scale-out block operation during which they were allowed to use the full system for their applications. The following projects participated in the extreme scale-out workshop: BQCD (Quantum Physics), SeisSol (Geophysi… ▽ More

    Submitted 6 September, 2016; originally announced September 2016.

    Comments: 10 pages, 5 figures, presented at ParCo2015 - Advances in Parallel Computing, held in Edinburgh, September 2015. The final publication is available at IOS Press through http://dx.doi.org/10.3233/978-1-61499-621-7-827

    Journal ref: Advances in Parallel Computing, vol. 27: Parallel Computing: On the Road to Exascale, eds. G.R. Joubert et al., p. 827, 2016

  40. arXiv:1604.01890  [pdf, ps, other

    cs.PF cs.DC

    Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors

    Authors: Johannes Hofmann, Dietmar Fey, Michael Riedmann, Jan Eitzinger, Georg Hager, Gerhard Wellein

    Abstract: We investigate the performance characteristics of a numerically enhanced scalar product (dot) kernel loop that uses the Kahan algorithm to compensate for numerical errors, and describe efficient SIMD-vectorized implementations on recent multi- and manycore processors. Using low-level instruction analysis and the execution-cache-memory (ECM) performance model we pinpoint the relevant performance bo… ▽ More

    Submitted 7 April, 2016; originally announced April 2016.

    Comments: 15 pages, 10 figures

    Journal ref: Concurrency Computat.: Pract. Exper., 29: e3921 (2016)

  41. arXiv:1511.03639  [pdf, ps, other

    cs.DC cs.AR

    Analysis of Intel's Haswell Microarchitecture Using The ECM Model and Microbenchmarks

    Authors: Johannes Hofmann, Dietmar Fey, Jan Eitzinger, Georg Hager, Gerhard Wellein

    Abstract: This paper presents an in-depth analysis of Intel's Haswell microarchitecture for streaming loop kernels. Among the new features examined is the dual-ring Uncore design, Cluster-on-Die mode, Uncore Frequency Scaling, core improvements as new and improved execution units, as well as improvements throughout the memory hierarchy. The Execution-Cache-Memory diagnostic performance model is used togethe… ▽ More

    Submitted 13 November, 2015; v1 submitted 11 November, 2015; originally announced November 2015.

    Comments: arXiv admin note: substantial text overlap with arXiv:1509.03118

  42. Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft

    Authors: Julian Hammer, Georg Hager, Jan Eitzinger, Gerhard Wellein

    Abstract: Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities. Unfortunately, analytic performance modeling is often tedious even for experienced develop… ▽ More

    Submitted 5 November, 2015; v1 submitted 12 September, 2015; originally announced September 2015.

    Comments: 11 pages, 4 figures, 8 listings

  43. GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Authors: Moritz Kreutzer, Jonas Thies, Melven Röhrig-Zöllner, Andreas Pieper, Faisal Shahzad, Martin Galgon, Achim Basermann, Holger Fehske, Georg Hager, Gerhard Wellein

    Abstract: While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such… ▽ More

    Submitted 15 February, 2016; v1 submitted 29 July, 2015; originally announced July 2015.

    Comments: 32 pages, 11 figures

  44. arXiv:1506.03997  [pdf, other

    cs.PF

    Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero

    Authors: Markus Wittmann, Thomas Zeiser, Georg Hager, Gerhard Wellein

    Abstract: Simple floating point operations like addition or multiplication on normalized floating point values can be computed by current AMD and Intel processors in three to five cycles. This is different for denormalized numbers, which appear when an underflow occurs and the value can no longer be represented as a normalized floating-point value. Here the costs are about two magnitudes higher.

    Submitted 12 June, 2015; originally announced June 2015.

  45. arXiv:1505.04628  [pdf, ps, other

    cs.DC

    Building a fault tolerant application using the GASPI communication layer

    Authors: Faisal Shahzad, Moritz Kreutzer, Thomas Zeiser, Rui Machado, Andreas Pieper, Georg Hager, Gerhard Wellein

    Abstract: It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. MPI is not yet… ▽ More

    Submitted 18 May, 2015; originally announced May 2015.

  46. Performance analysis of the Kahan-enhanced scalar product on current multicore processors

    Authors: Johannes Hofmann, Dietmar Fey, Jan Eitzinger, Georg Hager, Gerhard Wellein

    Abstract: We investigate the performance characteristics of a numerically enhanced scalar product (dot) kernel loop that uses the Kahan algorithm to compensate for numerical errors, and describe efficient SIMD-vectorized implementations on recent Intel processors. Using low-level instruction analysis and the execution-cache-memory (ECM) performance model we pinpoint the relevant performance bottlenecks for… ▽ More

    Submitted 11 May, 2015; originally announced May 2015.

    Comments: 10 pages, 4 figures

  47. arXiv:1410.5242  [pdf, ps, other

    cs.CE cond-mat.mes-hall cs.DC cs.PF physics.comp-ph

    Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems

    Authors: Moritz Kreutzer, Georg Hager, Gerhard Wellein, Andreas Pieper, Andreas Alvermann, Holger Fehske

    Abstract: The Kernel Polynomial Method (KPM) is a well-established scheme in quantum physics and quantum chemistry to determine the eigenvalue density and spectral properties of large sparse matrices. In this work we demonstrate the high optimization potential and feasibility of peta-scale heterogeneous CPU-GPU implementations of the KPM. At the node level we show that it is possible to decouple the sparse… ▽ More

    Submitted 29 July, 2015; v1 submitted 20 October, 2014; originally announced October 2014.

    Comments: 10 pages, 12 figures

    Journal ref: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 417-426

  48. Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

    Authors: Holger Stengel, Jan Treibig, Georg Hager, Gerhard Wellein

    Abstract: Stencil algorithms on regular lattices appear in many fields of computational science, and much effort has been put into optimized implementations. Such activities are usually not guided by performance models that provide estimates of expected speedup. Understanding the performance properties and bottlenecks by performance modeling enables a clear view on promising optimization opportunities. In t… ▽ More

    Submitted 17 January, 2015; v1 submitted 18 October, 2014; originally announced October 2014.

    Comments: 10 pages, 8 figures. Added Roofline comparison and other minor improvements

  49. Multicore-optimized wavefront diamond blocking for optimizing stencil updates

    Authors: Tareq Malas, Georg Hager, Hatem Ltaief, Holger Stengel, Gerhard Wellein, David Keyes

    Abstract: The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, espec… ▽ More

    Submitted 12 October, 2014; originally announced October 2014.

  50. arXiv:1410.0412  [pdf, other

    cs.DC

    Modeling and analyzing performance for highly optimized propagation steps of the lattice Boltzmann method on sparse lattices

    Authors: M. Wittmann, T. Zeiser, G. Hager, G. Wellein

    Abstract: Computational fluid dynamics (CFD) requires a vast amount of compute cycles on contemporary large-scale parallel computers. Hence, performance optimization is a pivotal activity in this field of computational science. Not only does it reduce the time to solution, but it also allows to minimize the energy consumption. In this work we study performance optimizations for an MPI-parallel lattice Boltz… ▽ More

    Submitted 23 December, 2015; v1 submitted 1 October, 2014; originally announced October 2014.

    Comments: Updated and extended version. Submitted to ISC 2015