-
THAPI: Tracing Heterogeneous APIs
Authors:
Solomon Bekele,
Aurelio Vivas,
Thomas Applencourt,
Servesh Muralidharan,
Bryce Allen,
Kazutomo Yoshiiinst,
Swann Perarnau,
Brice Videau
Abstract:
As we reach exascale, production High Performance Computing (HPC) systems are increasing in complexity. These systems now comprise multiple heterogeneous computing components (CPUs and GPUs) utilized through diverse, often vendor-specific programming models. As application developers and programming models experts develop higher-level, portable programming models for these systems, debugging and p…
▽ More
As we reach exascale, production High Performance Computing (HPC) systems are increasing in complexity. These systems now comprise multiple heterogeneous computing components (CPUs and GPUs) utilized through diverse, often vendor-specific programming models. As application developers and programming models experts develop higher-level, portable programming models for these systems, debugging and performance optimization requires understanding how multiple programming models stacked on top of each other interact with one another. This paper discusses THAPI (Tracing Heterogeneous APIs), a portable, programming model-centric tracing framework: by capturing comprehensive API call details across layers of the HPC software stack, THAPI enables fine-grained understanding and analysis of how applications interact with programming models and heterogeneous hardware. Leveraging state of the art tracing f ramework like the Linux Trace Toolkit Next Generation (LTTng) and tracing much more than other tracing toolkits, focused on function names and timestamps, this approach enables us to diagnose performance bottlenecks across the software stack, optimize application behavior, and debug programming model implementation issues.
△ Less
Submitted 14 April, 2025; v1 submitted 22 March, 2025;
originally announced April 2025.
-
Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach
Authors:
Xiongxiao Xu,
Solomon Abera Bekele,
Brice Videau,
Kai Shu
Abstract:
Energy consumption has become a critical design metric and a limiting factor in the development of future computing architectures, from small wearable devices to large-scale leadership computing facilities. The predominant methods in energy management optimization are focused on CPUs. However, GPUs are increasingly significant and account for the majority of energy consumption in heterogeneous hig…
▽ More
Energy consumption has become a critical design metric and a limiting factor in the development of future computing architectures, from small wearable devices to large-scale leadership computing facilities. The predominant methods in energy management optimization are focused on CPUs. However, GPUs are increasingly significant and account for the majority of energy consumption in heterogeneous high performance computing (HPC) systems. Moreover, they typically rely on either purely offline training or a hybrid of offline and online training, which are impractical and lead to energy loss during data collection. Therefore, this paper studies a novel and practical online energy optimization problem for GPUs in HPC scenarios. The problem is challenging due to the inherent performance-energy trade-offs of GPUs, the exploration & exploitation dilemma across frequencies, and the lack of explicit performance counters in GPUs. To address these challenges, we formulate the online energy consumption optimization problem as a multi-armed bandit framework and develop a novel bandit based framework EnergyUCB. EnergyUCB is designed to dynamically adjust GPU core frequencies in real-time, reducing energy consumption with minimal impact on performance. Specifically, the proposed framework EnergyUCB (1) balances the performance-energy trade-off in the reward function, (2) effectively navigates the exploration & exploitation dilemma when adjusting GPU core frequencies online, and (3) leverages the ratio of GPU core utilization to uncore utilization as a real-time GPU performance metric. Experiments on a wide range of real-world HPC benchmarks demonstrate that EnergyUCB can achieve substantial energy savings. The code of EnergyUCB is available at https://github.com/XiongxiaoXu/EnergyUCB-Bandit.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
Integrating ytopt and libEnsemble to Autotune OpenMC
Authors:
Xingfu Wu,
John R. Tramm,
Jeffrey Larson,
John-Luke Navarro,
Prasanna Balaprakash,
Brice Videau,
Michael Kruse,
Paul Hovland,
Valerie Taylor,
Mary Hall
Abstract:
ytopt is a Python machine-learning-based autotuning software package developed within the ECP PROTEAS-TUNE project. The ytopt software adopts an asynchronous search framework that consists of sampling a small number of input parameter configurations and progressively fitting a surrogate model over the input-output space until exhausting the user-defined maximum number of evaluations or the wall-cl…
▽ More
ytopt is a Python machine-learning-based autotuning software package developed within the ECP PROTEAS-TUNE project. The ytopt software adopts an asynchronous search framework that consists of sampling a small number of input parameter configurations and progressively fitting a surrogate model over the input-output space until exhausting the user-defined maximum number of evaluations or the wall-clock time. libEnsemble is a Python toolkit for coordinating workflows of asynchronous and dynamic ensembles of calculations across massively parallel resources developed within the ECP PETSc/TAO project. libEnsemble helps users take advantage of massively parallel resources to solve design, decision, and inference problems and expands the class of problems that can benefit from increased parallelism. In this paper we present our methodology and framework to integrate ytopt and libEnsemble to take advantage of massively parallel resources to accelerate the autotuning process. Specifically, we focus on using the proposed framework to autotune the ECP ExaSMR application OpenMC, an open source Monte Carlo particle transport code. OpenMC has seven tunable parameters some of which have large ranges such as the number of particles in-flight, which is in the range of 100,000 to 8 million, with its default setting of 1 million. Setting the proper combination of these parameter values to achieve the best performance is extremely time-consuming. Therefore, we apply the proposed framework to autotune the MPI/OpenMP offload version of OpenMC based on a user-defined metric such as the figure of merit (FoM) (particles/s) or energy efficiency energy-delay product (EDP) on Crusher at Oak Ridge Leadership Computing Facility. The experimental results show that we achieve improvement up to 29.49\% in FoM and up to 30.44\% in EDP.
△ Less
Submitted 17 September, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Transfer-Learning-Based Autotuning Using Gaussian Copula
Authors:
Thomas Randall,
Jaehoon Koo,
Brice Videau,
Michael Kruse,
Xingfu Wu,
Paul Hovland,
Mary Hall,
Rong Ge,
Prasanna Balaprakash
Abstract:
As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computatio…
▽ More
As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computationally expensive approach. Transfer learning (TL)-based autotuning seeks to address this issue by leveraging the data from prior tuning. Current TL methods for autotuning spend significant time modeling the relationship between parameter configurations and performance, which is ineffective for few-shot (that is, few empirical evaluations) tuning on new tasks. We introduce the first generative TL-based autotuning approach based on the Gaussian copula (GC) to model the high-performing regions of the search space from prior data and then generate high-performing configurations for new tasks. This allows a sampling-based approach that maximizes few-shot performance and provides the first probabilistic estimation of the few-shot budget for effective TL-based autotuning. We compare our generative TL approach with state-of-the-art autotuning techniques on several benchmarks. We find that the GC is capable of achieving 64.37% of peak few-shot performance in its first evaluation. Furthermore, the GC model can determine a few-shot transfer budget that yields up to 33.39$\times$ speedup, a dramatic improvement over the 20.58$\times$ speedup using prior techniques.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales
Authors:
Xingfu Wu,
Prasanna Balaprakash,
Michael Kruse,
Jaehoon Koo,
Brice Videau,
Paul Hovland,
Valerie Taylor,
Brad Geltz,
Siddhartha Jana,
Mary Hall
Abstract:
As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application r…
▽ More
As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application runtime and power/energy for energy efficient application execution, then use this framework to autotune four ECP proxy applications -- XSBench, AMG, SWFFT, and SW4lite. Our approach uses Bayesian optimization with a Random Forest surrogate model to effectively search parameter spaces with up to 6 million different configurations on two large-scale production systems, Theta at Argonne National Laboratory and Summit at Oak Ridge National Laboratory. The experimental results show that our autotuning framework at large scales has low overhead and achieves good scalability. Using the proposed autotuning framework to identify the best configurations, we achieve up to 91.59% performance improvement, up to 21.2% energy savings, and up to 37.84% EDP improvement on up to 4,096 nodes.
△ Less
Submitted 28 March, 2023;
originally announced March 2023.