-
Application Performance Modeling via Tensor Completion
Authors:
Edward Hutter,
Edgar Solomonik
Abstract:
Performance tuning, software/hardware co-design, and job scheduling are among the many tasks that rely on models to predict application performance. We propose and evaluate low-rank tensor decomposition for modeling application performance. We discretize the input and configuration domains of an application using regular grids. Application execution times mapped within grid-cells are averaged and…
▽ More
Performance tuning, software/hardware co-design, and job scheduling are among the many tasks that rely on models to predict application performance. We propose and evaluate low-rank tensor decomposition for modeling application performance. We discretize the input and configuration domains of an application using regular grids. Application execution times mapped within grid-cells are averaged and represented by tensor elements. We show that low-rank canonical-polyadic (CP) tensor decomposition is effective in approximating these tensors. We further show that this decomposition enables accurate extrapolation of unobserved regions of an application's parameter space. We then employ tensor completion to optimize a CP decomposition given a sparse set of observed execution times. We consider alternative piecewise/grid-based models and supervised learning models for six applications and demonstrate that CP decomposition optimized using tensor completion offers higher prediction accuracy and memory-efficiency for high-dimensional performance modeling.
△ Less
Submitted 29 August, 2023; v1 submitted 18 October, 2022;
originally announced October 2022.
-
Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths
Authors:
Edward Hutter,
Edgar Solomonik
Abstract:
The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configuration's performance by constructing confidence intervals to describe the performance of individual kernels (subroutines of ben…
▽ More
The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configuration's performance by constructing confidence intervals to describe the performance of individual kernels (subroutines of benchmarked programs). Once a kernel's performance is deemed sufficiently predictable for a set of inputs, subsequent invocations are avoided and replaced with a predictive model of the execution time. We then leverage online execution path analysis to coordinate selective kernel execution and propagate each kernel's statistical profile. This strategy is effective in the presence of frequently-recurring computation and communication kernels, which is characteristic to algorithms in numerical linear algebra. We encapsulate this framework as part of a new profiling tool, Critter, that automates kernel execution decisions and propagates statistical profiles along critical paths of execution. We evaluate performance prediction accuracy obtained by our selective execution methods using state-of-the-art distributed-memory implementations of Cholesky and QR factorization on Stampede2, and demonstrate speed-ups of up to 7.1x with 98% prediction accuracy.
△ Less
Submitted 1 March, 2021;
originally announced March 2021.
-
On the Hardness of Problems Involving Negator Relationships in an Artificial Hormone System
Authors:
Eric Hutter,
Mathias Pacher,
Uwe Brinkschulte
Abstract:
The Artificial Hormone System (AHS) is a self-organizing middleware to allocate tasks in a distributed system. We extended it by so-called negator hormones to enable conditional task structures. However, this extension increases the computational complexity of seemingly simple decision problems in the system: In [1] and [2], we defined the problems Negator-Path and Negator-Sat and proved their NP-…
▽ More
The Artificial Hormone System (AHS) is a self-organizing middleware to allocate tasks in a distributed system. We extended it by so-called negator hormones to enable conditional task structures. However, this extension increases the computational complexity of seemingly simple decision problems in the system: In [1] and [2], we defined the problems Negator-Path and Negator-Sat and proved their NP-completeness. In this supplementary report to these papers, we show examples of Negator-Path and Negator-Sat, introduce the novel problem Negator-Stability and explain why all of these problems involving negators are hard to solve algorithmically.
△ Less
Submitted 16 June, 2020;
originally announced June 2020.
-
Communication-avoiding Cholesky-QR2 for rectangular matrices
Authors:
Edward Hutter,
Edgar Solomonik
Abstract:
Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show its effectiveness for a wide range of matrix sizes. Our algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in s…
▽ More
Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show its effectiveness for a wide range of matrix sizes. Our algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in synchronization, interprocessor communication, computational work, and memory footprint. We implement this algorithm, yielding a code that can achieve a factor of $Θ(P^{1/6})$ less interprocessor communication on $P$ processors than any previous parallel QR implementation. Our performance study on Intel Knights-Landing and Cray XE supercomputers demonstrates the effectiveness of this CholeskyQR2 parallelization on a large number of nodes. Specifically, relative to ScaLAPACK's QR, on 1024 nodes of Stampede2, our CholeskyQR2 implementation is faster by 2.6x-3.3x in strong scaling tests and by 1.1x-1.9x in weak scaling tests.
△ Less
Submitted 15 June, 2019; v1 submitted 23 October, 2017;
originally announced October 2017.