-
Performance Debugging through Microarchitectural Sensitivity and Causality Analysis
Authors:
Alban Dutilleul,
Hugo Pompougnac,
Nicolas Derumigny,
Gabriel Rodriguez,
Valentin Trophime,
Christophe Guillon,
Fabrice Rastello
Abstract:
Modern Out-of-Order (OoO) CPUs are complex systems with many components interleaved in non-trivial ways. Pinpointing performance bottlenecks and understanding the underlying causes of program performance issues are critical tasks to fully exploit the performance offered by hardware resources.
Current performance debugging approaches rely either on measuring resource utilization, in order to esti…
▽ More
Modern Out-of-Order (OoO) CPUs are complex systems with many components interleaved in non-trivial ways. Pinpointing performance bottlenecks and understanding the underlying causes of program performance issues are critical tasks to fully exploit the performance offered by hardware resources.
Current performance debugging approaches rely either on measuring resource utilization, in order to estimate which parts of a CPU induce performance limitations, or on code-based analysis deriving bottleneck information from capacity/throughput models. These approaches are limited by instrumental and methodological precision, present portability constraints across different microarchitectures, and often offer factual information about resource constraints, but not causal hints about how to solve them.
This paper presents a novel performance debugging and analysis tool that implements a resource-centric CPU model driven by dynamic binary instrumentation that is capable of detecting complex bottlenecks caused by an interplay of hardware and software factors. Bottlenecks are detected through sensitivity-based analysis, a sort of model parameterization that uses differential analysis to reveal constrained resources. It also implements a new technique we developed that we call causality analysis, that propagates constraints to pinpoint how each instruction contribute to the overall execution time.
To evaluate our analysis tool, we considered the set of high-performance computing kernels obtained by applying a wide range of transformations from the Polybench benchmark suite and measured the precision on a few Intel CPU and Arm micro-architectures. We also took one of the benchmarks (correlation) as an illustrative example to illustrate how our tool's bottleneck analysis can be used to optimize a code.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Performance bottlenecks detection through microarchitectural sensitivity
Authors:
Hugo Pompougnac,
Alban Dutilleul,
Christophe Guillon,
Nicolas Derumigny,
Fabrice Rastello
Abstract:
Modern Out-of-Order (OoO) CPUs are complex systems with many components interleaved in non-trivial ways. Pinpointing performance bottlenecks and understanding the underlying causes of program performance issues are critical tasks to make the most of hardware resources.
We provide an in-depth overview of performance bottlenecks in recent OoO microarchitectures and describe the difficulties of det…
▽ More
Modern Out-of-Order (OoO) CPUs are complex systems with many components interleaved in non-trivial ways. Pinpointing performance bottlenecks and understanding the underlying causes of program performance issues are critical tasks to make the most of hardware resources.
We provide an in-depth overview of performance bottlenecks in recent OoO microarchitectures and describe the difficulties of detecting them. Techniques that measure resources utilization can offer a good understanding of a program's execution, but, due to the constraints inherent to Performance Monitoring Units (PMU) of CPUs, do not provide the relevant metrics for each use case.
Another approach is to rely on a performance model to simulate the CPU behavior. Such a model makes it possible to implement any new microarchitecture-related metric. Within this framework, we advocate for implementing modeled resources as parameters that can be varied at will to reveal performance bottlenecks. This allows a generalization of bottleneck analysis that we call sensitivity analysis.
We present Gus, a novel performance analysis tool that combines the advantages of sensitivity analysis and dynamic binary instrumentation within a resource-centric CPU model. We evaluate the impact of sensitivity on bottleneck analysis over a set of high-performance computing kernels.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
An Irredundant and Compressed Data Layout to Optimize Bandwidth Utilization of FPGA Accelerators
Authors:
Corentin Ferry,
Nicolas Derumigny,
Steven Derrien,
Sanjay Rajopadhye
Abstract:
Memory bandwidth is known to be a performance bottleneck for FPGA accelerators, especially when they deal with large multi-dimensional data-sets. A large body of work focuses on reducing of off-chip transfers, but few authors try to improve the efficiency of transfers. This paper addresses the later issue by proposing (i) a compiler-based approach to accelerator's data layout to maximize contiguou…
▽ More
Memory bandwidth is known to be a performance bottleneck for FPGA accelerators, especially when they deal with large multi-dimensional data-sets. A large body of work focuses on reducing of off-chip transfers, but few authors try to improve the efficiency of transfers. This paper addresses the later issue by proposing (i) a compiler-based approach to accelerator's data layout to maximize contiguous access to off-chip memory, and (ii) data packing and runtime compression techniques that take advantage of this layout to further improve memory performance. We show that our approach can decrease the I/O cycles up to $7\times$ compared to un-optimized memory accesses.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
PALMED: Throughput Characterization for Superscalar Architectures -- Extended Version
Authors:
Nicolas Derumigny,
Fabian Gruber,
Théophile Bastian,
Guillaume Iooss,
Christophe Guillon,
Louis-Noël Pouchet,
Fabrice Rastello
Abstract:
In a super-scalar architecture, the scheduler dynamically assigns micro-operations ($μ$OPs) to execution ports. The port mapping of an architecture describes how an instruction decomposes into $μ$OPs and lists for each $μ$OP the set of ports it can be mapped to. It is used by compilers and performance debugging tools to characterize the performance throughput of a sequence of instructions repeated…
▽ More
In a super-scalar architecture, the scheduler dynamically assigns micro-operations ($μ$OPs) to execution ports. The port mapping of an architecture describes how an instruction decomposes into $μ$OPs and lists for each $μ$OP the set of ports it can be mapped to. It is used by compilers and performance debugging tools to characterize the performance throughput of a sequence of instructions repeatedly executed as the core component of a loop.
This paper introduces a dual equivalent representation: The resource mapping of an architecture is an abstract model where, to be executed, an instruction must use a set of abstract resources, themselves representing combinations of execution ports. For a given architecture, finding a port mapping is an important but difficult problem. Building a resource mapping is a more tractable problem and provides a simpler and equivalent model. This paper describes Palmed, a tool that automatically builds a resource mapping for pipelined, super-scalar, out-of-order CPU architectures. Palmed does not require hardware performance counters, and relies solely on runtime measurements.
We evaluate the pertinence of our dual representation for throughput modeling by extracting a representative set of basic-blocks from the compiled binaries of the SPEC CPU 2017 benchmarks. We compared the throughput predicted by existing machine models to that produced by Palmed, and found comparable accuracy to state-of-the art tools, achieving sub-10 % mean square error rate on this workload on Intel's Skylake microarchitecture.
△ Less
Submitted 18 January, 2022; v1 submitted 21 December, 2020;
originally announced December 2020.
-
The gem5 Simulator: Version 20.0+
Authors:
Jason Lowe-Power,
Abdul Mutaal Ahmad,
Ayaz Akram,
Mohammad Alian,
Rico Amslinger,
Matteo Andreozzi,
Adrià Armejach,
Nils Asmussen,
Brad Beckmann,
Srikant Bharadwaj,
Gabe Black,
Gedare Bloom,
Bobby R. Bruce,
Daniel Rodrigues Carvalho,
Jeronimo Castrillon,
Lizhong Chen,
Nicolas Derumigny,
Stephan Diestelhorst,
Wendy Elsasser,
Carlos Escuin,
Marjan Fariborz,
Amin Farmahini-Farahani,
Pouya Fotouhi,
Ryan Gambord,
Jayneel Gandhi
, et al. (53 additional authors not shown)
Abstract:
The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 si…
▽ More
The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research.
△ Less
Submitted 29 September, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.