-
Characterizing GPU Energy Usage in Exascale-Ready Portable Science Applications
Authors:
William F. Godoy,
Oscar Hernandez,
Paul R. C. Kent,
Maria Patrou,
Kazi Asifuzzaman,
Narasinga Rao Miniskar,
Pedro Valero-Lara,
Jeffrey S. Vetter,
Matthew D. Sinclair,
Jason Lowe-Power,
Bobby R. Bruce
Abstract:
We characterize the GPU energy usage of two widely adopted exascale-ready applications representing two classes of particle and mesh solvers: (i) QMCPACK, a quantum Monte Carlo package, and (ii) AMReXCastro, an adaptive mesh astrophysical code. We analyze power, temperature, utilization, and energy traces from double-/single (mixed)-precision benchmarks on NVIDIA's A100 and H100 and AMD's MI250X G…
▽ More
We characterize the GPU energy usage of two widely adopted exascale-ready applications representing two classes of particle and mesh solvers: (i) QMCPACK, a quantum Monte Carlo package, and (ii) AMReXCastro, an adaptive mesh astrophysical code. We analyze power, temperature, utilization, and energy traces from double-/single (mixed)-precision benchmarks on NVIDIA's A100 and H100 and AMD's MI250X GPUs using queries in NVML and rocm_smi_lib, respectively. We explore application-specific metrics to provide insights on energy vs. performance trade-offs. Our results suggest that mixed-precision energy savings range between 6-25% on QMCPACK and 45% on AMReX-Castro. Also, we found gaps in the AMD tooling used on Frontier GPUs that need to be understood, while query resolutions on NVML have little variability between 1 ms-1 s. Overall, application level knowledge is crucial to define energy-cost/science-benefit opportunities for the codesign of future supercomputer architectures in the post-Moore era.
△ Less
Submitted 16 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
Adding MFMA Support to gem5
Authors:
Marco Kurzynski,
Matthew D. Sinclair
Abstract:
In this work we have enhanced gem5's GPU model support to add Matrix Core Engines (MCEs). Specifically, on the AMD MI200 and MI300 GPUs that gem5 supports, these MCEs perform Matrix Fused Multiply Add (MFMA) instructions for a variety of precisions. By adding this support, our changes enable running state-of-the-art ML workloads in gem5, as well as examining how MCE optimizations impact the behavi…
▽ More
In this work we have enhanced gem5's GPU model support to add Matrix Core Engines (MCEs). Specifically, on the AMD MI200 and MI300 GPUs that gem5 supports, these MCEs perform Matrix Fused Multiply Add (MFMA) instructions for a variety of precisions. By adding this support, our changes enable running state-of-the-art ML workloads in gem5, as well as examining how MCE optimizations impact the behavior of future systems.
△ Less
Submitted 2 February, 2025; v1 submitted 29 January, 2025;
originally announced January 2025.
-
Global Optimizations & Lightweight Dynamic Logic for Concurrency
Authors:
Suchita Pati,
Shaizeen Aga,
Nuwan Jayasena,
Matthew D. Sinclair
Abstract:
Modern accelerators like GPUs are increasingly executing independent operations concurrently to improve the device's compute utilization. However, effectively harnessing it on GPUs for important primitives such as general matrix multiplications (GEMMs) remains challenging. Although modern GPUs have significant hardware and software support for GEMMs, their kernel implementations and optimizations…
▽ More
Modern accelerators like GPUs are increasingly executing independent operations concurrently to improve the device's compute utilization. However, effectively harnessing it on GPUs for important primitives such as general matrix multiplications (GEMMs) remains challenging. Although modern GPUs have significant hardware and software support for GEMMs, their kernel implementations and optimizations typically assume each kernel executes in isolation and can utilize all GPU resources. This approach is highly efficient when kernels execute in isolation, but causes significant resource contention and slowdowns when kernels execute concurrently. Moreover, current approaches often only statically expose and control parallelism within an application, without considering runtime information such as varying input size and concurrent applications -- often exacerbating contention. These issues limit performance benefits from concurrently executing independent operations. Accordingly, we propose GOLDYLOC, which considers the global resources across all concurrent operations to identify performant GEMM kernels, which we call globally optimized (GO)-Kernels. Moreover, GOLDYLOC introduces a lightweight dynamic logic which considers the dynamic execution environment for available parallelism and input sizes to execute performant combinations of concurrent GEMMs on the GPU. Overall, GOLDYLOC improves performance of concurrent GEMMs on a real GPU by up to 2$\times$ (18% geomean per workload) and provides up to 2.5$\times$ (43% geomean per workload) speedups over sequential execution.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters
Authors:
Rutwik Jain,
Brandon Tran,
Keting Chen,
Matthew D. Sinclair,
Shivaram Venkataraman
Abstract:
Large-scale computing systems are increasingly using accelerators such as GPUs to enable peta- and exa-scale levels of compute to meet the needs of Machine Learning (ML) and scientific computing applications. Given the widespread and growing use of ML, including in some scientific applications, optimizing these clusters for ML workloads is particularly important. However, recent work has demonstra…
▽ More
Large-scale computing systems are increasingly using accelerators such as GPUs to enable peta- and exa-scale levels of compute to meet the needs of Machine Learning (ML) and scientific computing applications. Given the widespread and growing use of ML, including in some scientific applications, optimizing these clusters for ML workloads is particularly important. However, recent work has demonstrated that accelerators in these clusters can suffer from performance variability and this variability can lead to resource under-utilization and load imbalance. In this work we focus on how clusters schedulers, which are used to share accelerator-rich clusters across many concurrent ML jobs, can embrace performance variability to mitigate its effects. Our key insight to address this challenge is to characterize which applications are more likely to suffer from performance variability and take that into account while placing jobs on the cluster. We design a novel cluster scheduler, PAL, which uses performance variability measurements and application-specific profiles to improve job performance and resource utilization. PAL also balances performance variability with locality to ensure jobs are spread across as few nodes as possible. Overall, PAL significantly improves GPU-rich cluster scheduling: across traces for six ML workload applications spanning image, language, and vision models with a variety of variability profiles, PAL improves geomean job completion time by 42%, cluster utilization by 28%, and makespan by 47% over existing state-of-the-art schedulers.
△ Less
Submitted 19 September, 2024; v1 submitted 21 August, 2024;
originally announced August 2024.
-
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
Authors:
Suchita Pati,
Shaizeen Aga,
Mahzabeen Islam,
Nuwan Jayasena,
Matthew D. Sinclair
Abstract:
Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serializ…
▽ More
Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy.
To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Fifty Years of ISCA: A data-driven retrospective on key trends
Authors:
Gaurang Upasani,
Matthew D. Sinclair,
Adrian Sampson,
Parthasarathy Ranganathan,
David Patterson,
Shaan Shah,
Nidhi Parthasarathy,
Rutwik Jain
Abstract:
Computer Architecture, broadly, involves optimizing hardware and software for current and future processing systems. Although there are several other top venues to publish Computer Architecture research, including ASPLOS, HPCA, and MICRO, ISCA (the International Symposium on Computer Architecture) is one of the oldest, longest running, and most prestigious venues for publishing Computer Architectu…
▽ More
Computer Architecture, broadly, involves optimizing hardware and software for current and future processing systems. Although there are several other top venues to publish Computer Architecture research, including ASPLOS, HPCA, and MICRO, ISCA (the International Symposium on Computer Architecture) is one of the oldest, longest running, and most prestigious venues for publishing Computer Architecture research. Since 1973, except for 1975, ISCA has been organized annually. Accordingly, this year will be the 50th year of ISCA. Thus, we set out to analyze the past 50 years of ISCA to understand who and what has been driving and innovating computing systems thus far. Our analysis identifies several interesting trends that reflect how ISCA, and Computer Architecture in general, has grown and evolved in the past 50 years, including minicomputers, general-purpose uniprocessor CPUs, multiprocessor and multi-core CPUs, general-purpose GPUs, and accelerators.
△ Less
Submitted 18 November, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
Integrating Per-Stream Stat Tracking into Accel-Sim
Authors:
Shichen Qiao,
Xin Su,
Matthew D. Sinclair
Abstract:
Accel-Sim is a widely used computer architecture simulator that models the behavior of modern NVIDIA GPUs in great detail. However, although Accel-Sim and the underlying GPGPU-Sim model many of the features of real GPUs, thus far it has not been able to track statistics separately per stream. Instead, Accel-Sim combines statistics (e.g., cycles and cache hits/misses) across all simultaneously runn…
▽ More
Accel-Sim is a widely used computer architecture simulator that models the behavior of modern NVIDIA GPUs in great detail. However, although Accel-Sim and the underlying GPGPU-Sim model many of the features of real GPUs, thus far it has not been able to track statistics separately per stream. Instead, Accel-Sim combines statistics (e.g., cycles and cache hits/misses) across all simultaneously running streams. This can prevent users from properly identifying the behavior of specific kernels and streams and potentially lead to incorrect conclusions. Thus, in this work we extend Accel-Sim's and GPGPU-Sim's statistic tracking support to track per-stream statistics. To validate this support, we designed a series of multi-stream microbenchmarks and checked their reported per-kernel, per-stream counts.
△ Less
Submitted 4 September, 2023; v1 submitted 21 April, 2023;
originally announced April 2023.
-
Computation vs. Communication Scaling for Future Transformers on Future Hardware
Authors:
Suchita Pati,
Shaizeen Aga,
Mahzabeen Islam,
Nuwan Jayasena,
Matthew D. Sinclair
Abstract:
Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is important to understand how will compute and communication scale relative to one another as models scale and hardware evolves? A careful study which ans…
▽ More
Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is important to understand how will compute and communication scale relative to one another as models scale and hardware evolves? A careful study which answers this question can better guide the design of future systems which can efficiently train future large models.
Accordingly, this work provides a comprehensive multi-axial (algorithmic, empirical, hardware evolution) analysis of compute vs. communication (Comp-vs.-Comm) scaling for future Transformer models on future hardware. First, our algorithmic analysis shows that compute generally enjoys an edge over communication as models scale. However, since memory capacity scales slower than compute, these trends are being stressed. Next, we quantify this edge by empirically studying how Comp-vs.-Comm scales for future models on future hardware. To avoid profiling numerous Transformer models across many setups, we extract execution regions and project costs using operator models. This allows a spectrum (hundreds) of future model/hardware scenarios to be accurately studied ($<$15% error), and reduces profiling costs by 2100$\times$. Our experiments show that communication will be a significant portion (40-75%) of runtime as models and hardware evolve. Moreover, communication which is hidden by overlapped computation in today's models often cannot be hidden in future, larger models. Overall, this work highlights the increasingly large role communication will play as models scale and discusses techniques and upcoming technologies that can help address it.
△ Less
Submitted 2 May, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems
Authors:
Prasoon Sinha,
Akhil Guliani,
Rutwik Jain,
Brandon Tran,
Matthew D. Sinclair,
Shivaram Venkataraman
Abstract:
Scientists are increasingly exploring and utilizing the massive parallelism of general-purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters, hyperscalers, national computing centers, and supercomputers have procured hardware to support this evolving application paradigm. These systems contain hundreds to tens of thousands of accelerators, enabling peta- and exa-…
▽ More
Scientists are increasingly exploring and utilizing the massive parallelism of general-purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters, hyperscalers, national computing centers, and supercomputers have procured hardware to support this evolving application paradigm. These systems contain hundreds to tens of thousands of accelerators, enabling peta- and exa-scale levels of compute for scientific workloads. Recent work demonstrated that power management (PM) can impact application performance in CPU-based HPC systems, even when machines have the same architecture and SKU (stock keeping unit). This variation occurs due to manufacturing variability and the chip's PM. However, while modern HPC systems widely employ accelerators such as GPUs, it is unclear how much this variability affects applications. Accordingly, we seek to characterize the extent of variation due to GPU PM in modern HPC and supercomputing systems. We study a variety of applications that stress different GPU components on five large-scale computing centers with modern GPUs: Oak Ridge's Summit, Sandia's Vortex, TACC's Frontera and Longhorn, and Livermore's Corona. These clusters use a variety of cooling methods and GPU vendors. In total, we collect over 18,800 hours of data across more than 90% of the GPUs in these clusters. Regardless of the application, cluster, GPU vendor, and cooling method, our results show significant variation: 8% (max 22%) average performance variation even though the GPU architecture and vendor SKU are identical within each cluster, with outliers up to 1.5X slower than the median GPU. These results highlight the difficulty in efficiently using existing GPU clusters for modern HPC and scientific workloads, and the need to embrace variability in future accelerator-based systems.
△ Less
Submitted 8 November, 2022; v1 submitted 23 August, 2022;
originally announced August 2022.
-
A Case for Fine-grain Coherence Specialization in Heterogeneous Systems
Authors:
Johnathan Alsop,
Weon Taek Na,
Matthew D. Sinclair,
Samuel Grayson,
Sarita V. Adve
Abstract:
Hardware specialization is becoming a key enabler of energyefficient performance. Future systems will be increasingly heterogeneous, integrating multiple specialized and programmable accelerators, each with different memory demands. Traditionally, communication between accelerators has been inefficient, typically orchestrated through explicit DMA transfers between different address spaces. More re…
▽ More
Hardware specialization is becoming a key enabler of energyefficient performance. Future systems will be increasingly heterogeneous, integrating multiple specialized and programmable accelerators, each with different memory demands. Traditionally, communication between accelerators has been inefficient, typically orchestrated through explicit DMA transfers between different address spaces. More recently, industry has proposed unified coherent memory which enables implicit data movement and more data reuse, but often these interfaces limit the coherence flexibility available to heterogeneous systems. This paper demonstrates the benefits of fine-grained coherence specialization for heterogeneous systems. We propose an architecture that enables low-complexity independent specialization of each individual coherence request in heterogeneous workloads by building upon a simple and flexible baseline coherence interface, Spandex. We then describe how to optimize individual memory requests to improve cache reuse and performance-critical memory latency in emerging heterogeneous workloads. Collectively, our techniques enable significant gains, reducing execution time by up to 61% or network traffic by up to 99% while adding minimal complexity to the Spandex protocol.
△ Less
Submitted 23 April, 2021;
originally announced April 2021.
-
Demystifying BERT: Implications for Accelerator Design
Authors:
Suchita Pati,
Shaizeen Aga,
Nuwan Jayasena,
Matthew D. Sinclair
Abstract:
Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved language representation with models that can tackle challenging language problems. Consequently, these applications are driving the requirements of future systems. Thus, we focus on BERT, one of the most popular NLP transfer…
▽ More
Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved language representation with models that can tackle challenging language problems. Consequently, these applications are driving the requirements of future systems. Thus, we focus on BERT, one of the most popular NLP transfer learning algorithms, to identify how its algorithmic behavior can guide future accelerator design. To this end, we carefully profile BERT training and identify key algorithmic behaviors which are worthy of attention in accelerator design.
We observe that while computations which manifest as matrix multiplication dominate BERT's overall runtime, as in many convolutional neural networks, memory-intensive computations also feature prominently. We characterize these computations, which have received little attention so far. Further, we also identify heterogeneity in compute-intensive BERT computations and discuss software and possible hardware mechanisms to further optimize these computations. Finally, we discuss implications of these behaviors as networks get larger and use distributed training environments, and how techniques such as micro-batching and mixed-precision training scale. Overall, our analysis identifies holistic solutions to optimize systems for BERT-like models.
△ Less
Submitted 13 April, 2021;
originally announced April 2021.
-
SeqPoint: Identifying Representative Iterations of Sequence-based Neural Networks
Authors:
Suchita Pati,
Shaizeen Aga,
Matthew D. Sinclair,
Nuwan Jayasena
Abstract:
The ubiquity of deep neural networks (DNNs) continues to rise, making them a crucial application class for hardware optimizations. However, detailed profiling and characterization of DNN training remains difficult as these applications often run for hours to days on real hardware. Prior works exploit the iterative nature of DNNs to profile a few training iterations. While such a strategy is sound…
▽ More
The ubiquity of deep neural networks (DNNs) continues to rise, making them a crucial application class for hardware optimizations. However, detailed profiling and characterization of DNN training remains difficult as these applications often run for hours to days on real hardware. Prior works exploit the iterative nature of DNNs to profile a few training iterations. While such a strategy is sound for networks like convolutional neural networks (CNNs), where the nature of the computation is largely input independent, we observe in this work that this approach is sub-optimal for sequence-based neural networks (SQNNs) such as recurrent neural networks (RNNs). The amount and nature of computations in SQNNs can vary for each input, resulting in heterogeneity across iterations. Thus, arbitrarily selecting a few iterations is insufficient to accurately summarize the behavior of the entire training run. To tackle this challenge, we carefully study the factors that impact SQNN training iterations and identify input sequence length as the key determining factor for variations across iterations. We then use this observation to characterize all iterations of an SQNN training run (requiring no profiling or simulation of the application) and select representative iterations, which we term SeqPoints. We analyze two state-of-the-art SQNNs, DeepSpeech2 and Google's Neural Machine Translation (GNMT), and show that SeqPoints can represent their entire training runs accurately, resulting in geomean errors of only 0.11% and 0.53%, respectively, when projecting overall runtime and 0.13% and 1.50% when projecting speedups due to architectural changes. This high accuracy is achieved while reducing the time needed for profiling by 345x and 214x for the two networks compared to full training runs. As a result, SeqPoint can enable analysis of SQNN training runs in mere minutes instead of hours or days.
△ Less
Submitted 20 July, 2020;
originally announced July 2020.
-
The gem5 Simulator: Version 20.0+
Authors:
Jason Lowe-Power,
Abdul Mutaal Ahmad,
Ayaz Akram,
Mohammad Alian,
Rico Amslinger,
Matteo Andreozzi,
AdriĆ Armejach,
Nils Asmussen,
Brad Beckmann,
Srikant Bharadwaj,
Gabe Black,
Gedare Bloom,
Bobby R. Bruce,
Daniel Rodrigues Carvalho,
Jeronimo Castrillon,
Lizhong Chen,
Nicolas Derumigny,
Stephan Diestelhorst,
Wendy Elsasser,
Carlos Escuin,
Marjan Fariborz,
Amin Farmahini-Farahani,
Pouya Fotouhi,
Ryan Gambord,
Jayneel Gandhi
, et al. (53 additional authors not shown)
Abstract:
The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 si…
▽ More
The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research.
△ Less
Submitted 29 September, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.
-
Specializing Coherence, Consistency, and Push/Pull for GPU Graph Analytics
Authors:
Giordano Salvador,
Wesley H. Darvin,
Muhammad Huzaifa,
Johnathan Alsop,
Matthew D. Sinclair,
Sarita V. Adve
Abstract:
This work provides the first study to explore the interaction of update propagation with and without fine-grained synchronization (push vs. pull), emerging coherence protocols (GPU vs. DeNovo coherence), and software-centric consistency models (DRF0, DRF1, and DRFrlx) for graph workloads on emerging integrated GPU-CPU systems with native unified shared memory. We study 6 graph applications with 6…
▽ More
This work provides the first study to explore the interaction of update propagation with and without fine-grained synchronization (push vs. pull), emerging coherence protocols (GPU vs. DeNovo coherence), and software-centric consistency models (DRF0, DRF1, and DRFrlx) for graph workloads on emerging integrated GPU-CPU systems with native unified shared memory. We study 6 graph applications with 6 graph inputs for a total of 36 workloads running on 12 system (hardware+software) configurations reflecting the above design space of update propagation, coherence, and memory consistency. We make three key contributions. First, we show that there is no single best system configuration for all workloads, motivating systems with flexible coherence and consistency support. Second, we develop a model to accurately predict the best system configuration -- this model can be used by software designers to decide on push vs. pull and the consistency model and by flexible hardware to invoke the appropriate coherence and consistency configuration for the given workload. Third, we show that the design dimensions explored here are inter-dependent, reinforcing the need for software-hardware co-design in the above design dimensions. For example, software designers deciding on push vs. pull must consider the consistency model supported by hardware -- in some cases, push maybe better if hardware supports DRFrlx while pull may be better if hardware does not support DRFrlx.
△ Less
Submitted 25 February, 2020; v1 submitted 19 February, 2020;
originally announced February 2020.
-
Optimizing GPU Cache Policies for MI Workloads
Authors:
Johnathan Alsop,
Matthew D. Sinclair,
Srikant Bharadwaj,
Alexandru Dutu,
Anthony Gutierrez,
Onur Kayiran,
Michael LeBeane,
Sooraj Puthoor,
Xianwei Zhang,
Tsung Tai Yeh,
Bradford M. Beckmann
Abstract:
In recent years, machine intelligence (MI) applications have emerged as a major driver for the computing industry. Optimizing these workloads is important but complicated. As memory demands grow and data movement overheads increasingly limit performance, determining the best GPU caching policy to use for a diverse range of MI workloads represents one important challenge. To study this, we evaluate…
▽ More
In recent years, machine intelligence (MI) applications have emerged as a major driver for the computing industry. Optimizing these workloads is important but complicated. As memory demands grow and data movement overheads increasingly limit performance, determining the best GPU caching policy to use for a diverse range of MI workloads represents one important challenge. To study this, we evaluate 17 MI applications and characterize their behaviors using a range of GPU caching strategies. In our evaluations, we find that the choice of caching policy in GPU caches involves multiple performance trade-offs and interactions, and there is no one-size-fits-all GPU caching policy for MI workloads. Based on detailed simulation results, we motivate and evaluate a set of cache optimizations that consistently match the performance of the best static GPU caching policies.
△ Less
Submitted 30 September, 2019;
originally announced October 2019.
-
Analyzing Machine Learning Workloads Using a Detailed GPU Simulator
Authors:
Jonathan Lew,
Deval Shah,
Suchita Pati,
Shaylin Cattell,
Mengchi Zhang,
Amruth Sandhupatla,
Christopher Ng,
Negar Goli,
Matthew D. Sinclair,
Timothy G. Rogers,
Tor Aamodt
Abstract:
Most deep neural networks deployed today are trained using GPUs via high-level frameworks such as TensorFlow and PyTorch. This paper describes changes we made to the GPGPU-Sim simulator to enable it to run PyTorch by running PTX kernels included in NVIDIA's cuDNN library. We use the resulting modified simulator, which has been made available publicly with this paper, to study some simple deep lear…
▽ More
Most deep neural networks deployed today are trained using GPUs via high-level frameworks such as TensorFlow and PyTorch. This paper describes changes we made to the GPGPU-Sim simulator to enable it to run PyTorch by running PTX kernels included in NVIDIA's cuDNN library. We use the resulting modified simulator, which has been made available publicly with this paper, to study some simple deep learning workloads. With our changes to GPGPU-Sim's functional simulation model, we find GPGPU-Sim performance model running a cuDNN enabled implementation of LeNet for MNIST reports results within 30% of real hardware. Using GPGPU-Sim's AerialVision performance analysis tool we observe that cuDNN API calls contain many varying phases and appear to include potentially inefficient microarchitecture behaviour such as DRAM partition bank camping, at least when executed on GPGPU-Sim's current performance model.
△ Less
Submitted 26 January, 2019; v1 submitted 18 November, 2018;
originally announced November 2018.