-
RAPTOR: Practical Numerical Profiling of Scientific Applications
Authors:
Faveo Hoerold,
Ivan R. Ivanov,
Akash Dhruv,
William S. Moses,
Anshu Dubey,
Mohamed Wahib,
Jens Domke
Abstract:
The proliferation of low-precision units in modern high-performance architectures increasingly burdens domain scientists. Historically, the choice in HPC was easy: can we get away with 32 bit floating-point operations and lower bandwidth requirements, or is FP64 necessary? Driven by Artificial Intelligence, vendors introduced novel low-precision units for vector and tensor operations, and FP64 cap…
▽ More
The proliferation of low-precision units in modern high-performance architectures increasingly burdens domain scientists. Historically, the choice in HPC was easy: can we get away with 32 bit floating-point operations and lower bandwidth requirements, or is FP64 necessary? Driven by Artificial Intelligence, vendors introduced novel low-precision units for vector and tensor operations, and FP64 capabilities stagnate or are reduced. This is forcing scientists to re-evaluate their codes, but a trivial search-and-replace approach to go from FP64 to FP16 will not suffice. We introduce RAPTOR: a numerical profiling tool to guide scientists in their search for code regions where precision lowering is feasible. Using LLVM, we transparently replace high-precision computations using low-precision units, or emulate a user-defined precision. RAPTOR is a novel, feature-rich approach -- with focus on ease of use -- to change, profile, and reason about numerical requirements and instabilities, which we demonstrate with four real-world multi-physics Flash-X applications.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Distributed Cross-Channel Hierarchical Aggregation for Foundation Models
Authors:
Aristeidis Tsaris,
Isaac Lyngaas,
John Lagregren,
Mohamed Wahib,
Larry York,
Prasanna Balaprakash,
Dan Lu,
Feiyi Wang,
Xiao Wang
Abstract:
Vision-based scientific foundation models hold significant promise for advancing scientific discovery and innovation. This potential stems from their ability to aggregate images from diverse sources such as varying physical groundings or data acquisition systems and to learn spatio-temporal correlations using transformer architectures. However, tokenizing and aggregating images can be compute-inte…
▽ More
Vision-based scientific foundation models hold significant promise for advancing scientific discovery and innovation. This potential stems from their ability to aggregate images from diverse sources such as varying physical groundings or data acquisition systems and to learn spatio-temporal correlations using transformer architectures. However, tokenizing and aggregating images can be compute-intensive, a challenge not fully addressed by current distributed methods. In this work, we introduce the Distributed Cross-Channel Hierarchical Aggregation (D-CHAG) approach designed for datasets with a large number of channels across image modalities. Our method is compatible with any model-parallel strategy and any type of vision transformer architecture, significantly improving computational efficiency. We evaluated D-CHAG on hyperspectral imaging and weather forecasting tasks. When integrated with tensor parallelism and model sharding, our approach achieved up to a 75% reduction in memory usage and more than doubled sustained throughput on up to 1,024 AMD GPUs on the Frontier Supercomputer.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Balanced and Elastic End-to-end Training of Dynamic LLMs
Authors:
Mohamed Wahib,
Muhammed Abdullah Soyturk,
Didem Unat
Abstract:
To reduce computational and memory costs in Large Language Models (LLMs), dynamic workload reduction schemes like Mixture of Experts (MoEs), parameter pruning, layer freezing, sparse attention, early token exit, and Mixture of Depths (MoDs) have emerged. However, these methods introduce severe workload imbalances, limiting their practicality for large-scale distributed training. We propose DynMo,…
▽ More
To reduce computational and memory costs in Large Language Models (LLMs), dynamic workload reduction schemes like Mixture of Experts (MoEs), parameter pruning, layer freezing, sparse attention, early token exit, and Mixture of Depths (MoDs) have emerged. However, these methods introduce severe workload imbalances, limiting their practicality for large-scale distributed training. We propose DynMo, an autonomous dynamic load balancing solution that ensures optimal compute distribution when using pipeline parallelism in training dynamic models. DynMo adaptively balances workloads, dynamically packs tasks into fewer workers to free idle resources, and supports both multi-GPU single-node and multi-node systems. Compared to static training methods (Megatron-LM, DeepSpeed), DynMo accelerates training by up to 1.23x (MoEs), 3.18x (pruning), 2.23x (layer freezing), 4.02x (sparse attention), 4.52x (early exit), and 1.17x (MoDs). DynMo is available at https://anonymous.4open.science/r/DynMo-4D04/.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Paradigm Shift in Infrastructure Inspection Technology: Leveraging High-performance Imaging and Advanced AI Analytics to Inspect Road Infrastructure
Authors:
Du Wu,
Enzhi Zhang,
Isaac Lyngaas,
Xiao Wang,
Amir Ziabari,
Tao Luo,
Peng Chen,
Kento Sato,
Fumiyoshi Shoji,
Takaki Hatsui,
Kentaro Uesugi,
Akira Seo,
Yasuhito Sakai,
Toshio Endo,
Tetsuya Ishikawa,
Satoshi Matsuoka,
Mohamed Wahib
Abstract:
Effective road infrastructure management is crucial for modern society. Traditional manual inspection techniques remain constrained by cost, efficiency, and scalability, while camera and laser imaging methods fail to capture subsurface defects critical for long-term structural integrity. This paper introduces ROVAI, an end-to-end framework that integrates high-resolution X-ray computed tomography…
▽ More
Effective road infrastructure management is crucial for modern society. Traditional manual inspection techniques remain constrained by cost, efficiency, and scalability, while camera and laser imaging methods fail to capture subsurface defects critical for long-term structural integrity. This paper introduces ROVAI, an end-to-end framework that integrates high-resolution X-ray computed tomography imaging and advanced AI-driven analytics, aiming to transform road infrastructure inspection technologies. By leveraging the computational power of world-leading supercomputers, Fugaku and Frontier, and SoTA synchrotron facility (Spring-8), ROVAI enables scalable and high-throughput processing of massive 3D tomographic datasets. Our approach overcomes key challenges, such as the high memory requirements of vision models, the lack of labeled training data, and storage I/O bottlenecks. This seamless integration of imaging and AI analytics facilitates automated defect detection, material composition analysis, and lifespan prediction. Experimental results demonstrate the effectiveness of ROVAI in real-world scenarios, setting a new standard for intelligent, data-driven infrastructure management.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Think Before You Attribute: Improving the Performance of LLMs Attribution Systems
Authors:
João Eduardo Batista,
Emil Vatai,
Mohamed Wahib
Abstract:
Large Language Models (LLMs) are increasingly applied in various science domains, yet their broader adoption remains constrained by a critical challenge: the lack of trustworthy, verifiable outputs. Current LLMs often generate answers without reliable source attribution, or worse, with incorrect attributions, posing a barrier to their use in scientific and high-stakes settings, where traceability…
▽ More
Large Language Models (LLMs) are increasingly applied in various science domains, yet their broader adoption remains constrained by a critical challenge: the lack of trustworthy, verifiable outputs. Current LLMs often generate answers without reliable source attribution, or worse, with incorrect attributions, posing a barrier to their use in scientific and high-stakes settings, where traceability and accountability are non-negotiable. To be reliable, attribution systems need high accuracy and retrieve data with short lengths, i.e., attribute to a sentence within a document rather than a whole document. We propose a sentence-level pre-attribution step for Retrieve-Augmented Generation (RAG) systems that classify sentences into three categories: not attributable, attributable to a single quote, and attributable to multiple quotes. By separating sentences before attribution, a proper attribution method can be selected for the type of sentence, or the attribution can be skipped altogether. Our results indicate that classifiers are well-suited for this task. In this work, we propose a pre-attribution step to reduce the computational complexity of attribution, provide a clean version of the HAGRID dataset, and provide an end-to-end attribution system that works out of the box.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling
Authors:
Xiao Wang,
Jong-Youl Choi,
Takuya Kurihaya,
Isaac Lyngaas,
Hong-Jun Yoon,
Ming Fan,
Nasik Muhammad Nafi,
Aristeidis Tsaris,
Ashwin M. Aji,
Maliha Hossain,
Mohamed Wahib,
Dali Wang,
Peter Thornton,
Prasanna Balaprakash,
Moetasim Ashfaq,
Dan Lu
Abstract:
Sparse observations and coarse-resolution climate models limit effective regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, hyper-reso…
▽ More
Sparse observations and coarse-resolution climate models limit effective regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, hyper-resolution climate downscaling. ORBIT-2 incorporates two key innovations: (1) Residual Slim ViT (Reslim), a lightweight architecture with residual learning and Bayesian regularization for efficient, robust prediction; and (2) TILES, a tile-wise sequence scaling algorithm that reduces self-attention complexity from quadratic to linear, enabling long-sequence processing and massive parallelism. ORBIT-2 scales to 10 billion parameters across 32,768 GPUs, achieving up to 1.8 ExaFLOPS sustained throughput and 92-98% strong scaling efficiency. It supports downscaling to 0.9 km global resolution and processes sequences up to 4.2 billion tokens. On 7 km resolution benchmarks, ORBIT-2 achieves high accuracy with R^2 scores in the range of 0.98 to 0.99 against observation data.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
A Unifying Framework to Enable Artificial Intelligence in High Performance Computing Workflows
Authors:
Jens Domke,
Mohamed Wahib,
Anshu Dubey,
Tal Ben-Nun,
Erik W. Draeger
Abstract:
Current trends point to a future where large-scale scientific applications are tightly-coupled HPC/AI hybrids. Hence, we urgently need to invest in creating a seamless, scalable framework where HPC and AI/ML can efficiently work together and adapt to novel hardware and vendor libraries without starting from scratch every few years. The current ecosystem and sparsely-connected community are not suf…
▽ More
Current trends point to a future where large-scale scientific applications are tightly-coupled HPC/AI hybrids. Hence, we urgently need to invest in creating a seamless, scalable framework where HPC and AI/ML can efficiently work together and adapt to novel hardware and vendor libraries without starting from scratch every few years. The current ecosystem and sparsely-connected community are not sufficient to tackle these challenges, and we require a breakthrough catalyst for science similar to what PyTorch enabled for AI.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Can Tensor Cores Benefit Memory-Bound Kernels? (No!)
Authors:
Lingqi Zhang,
Jiajun Huang,
Sheng Di,
Satoshi Matsuoka,
Mohamed Wahib
Abstract:
Tensor cores are specialized processing units within GPUs that have demonstrated significant efficiency gains in compute-bound applications such as Deep Learning Training by accelerating dense matrix operations. Given their success, researchers have attempted to extend tensor core capabilities beyond dense matrix computations to other computational patterns, including memory-bound kernels. Recent…
▽ More
Tensor cores are specialized processing units within GPUs that have demonstrated significant efficiency gains in compute-bound applications such as Deep Learning Training by accelerating dense matrix operations. Given their success, researchers have attempted to extend tensor core capabilities beyond dense matrix computations to other computational patterns, including memory-bound kernels. Recent studies have reported that tensor cores can outperform traditional CUDA cores even on memory-bound kernels, where the primary performance bottleneck is not computation. In this research, we challenge these findings through both theoretical and empirical analysis. Our theoretical analysis reveals that tensor cores can achieve a maximum speedup of only 1.33x over CUDA cores for memory-bound kernels in double precision (for V100, A100, and H100 GPUs). We validate this theoretical limit through empirical analysis of three representative memory-bound kernels-STREAM Scale, SpMV, and stencil. We demonstrate that optimizing memory-bound kernels using tensor cores does not yield sound performance improvements over CUDA cores.
△ Less
Submitted 27 February, 2025; v1 submitted 24 February, 2025;
originally announced February 2025.
-
Scaling Large-scale GNN Training to Thousands of Processors on CPU-based Supercomputers
Authors:
Chen Zhuang,
Lingqi Zhang,
Du Wu,
Peng Chen,
Jiajun Huang,
Xin Liu,
Rio Yokota,
Nikoli Dryden,
Toshio Endo,
Satoshi Matsuoka,
Mohamed Wahib
Abstract:
Graph Convolutional Networks (GCNs), particularly for large-scale graphs, are crucial across numerous domains. However, training distributed full-batch GCNs on large-scale graphs suffers from inefficient memory access patterns and high communication overhead. To address these challenges, we introduce \method{}, an efficient and scalable distributed GCN training framework tailored for CPU-powered s…
▽ More
Graph Convolutional Networks (GCNs), particularly for large-scale graphs, are crucial across numerous domains. However, training distributed full-batch GCNs on large-scale graphs suffers from inefficient memory access patterns and high communication overhead. To address these challenges, we introduce \method{}, an efficient and scalable distributed GCN training framework tailored for CPU-powered supercomputers. Our contributions are threefold: (1) we develop general and efficient aggregation operators designed for irregular memory access, (2) we propose a hierarchical aggregation scheme that reduces communication costs without altering the graph structure, and (3) we present a communication-aware quantization scheme to enhance performance. Experimental results demonstrate that \method{} achieves a speedup of up to 6$\times$ compared with the SoTA implementations, and scales to 1000s of HPC-grade CPUs on the largest publicly available datasets, without sacrificing model convergence and accuracy. Moreover, due to the effective strong scaling of \method{}, we outperform SoTA GPU-based and CPU-based distributed full-batch GCN training frameworks, in absolute performance, for large-scale graphs.
△ Less
Submitted 26 May, 2025; v1 submitted 24 November, 2024;
originally announced November 2024.
-
Tadashi: Enabling AI-Based Automated Code Generation With Guaranteed Correctness
Authors:
Emil Vatai,
Aleksandr Drozd,
Ivan R. Ivanov,
Joao E. Batista,
Yinghao Ren,
Mohamed Wahib
Abstract:
Frameworks and domain-specific languages for auto-generating code have traditionally depended on human experts to implement rigorous methods ensuring the legality of code transformations. Recently, machine learning (ML) has gained traction for generating code optimized for specific hardware targets. However, ML approaches-particularly black-box neural networks-offer no guarantees on the correctnes…
▽ More
Frameworks and domain-specific languages for auto-generating code have traditionally depended on human experts to implement rigorous methods ensuring the legality of code transformations. Recently, machine learning (ML) has gained traction for generating code optimized for specific hardware targets. However, ML approaches-particularly black-box neural networks-offer no guarantees on the correctness or legality of the transformations they produce. To address this gap, we introduce Tadashi, an end-to-end system that leverages the polyhedral model to support researchers in curating datasets critical for ML-based code generation. Tadashi provides an end-to-end system capable of applying, verifying, and evaluating candidate transformations on polyhedral schedules with both reliability and practicality. We formally prove that Tadashi guarantees the legality of generated transformations, demonstrate its low runtime overhead, and showcase its broad applicability. Tadashi available at https://github.com/vatai/tadashi/.
△ Less
Submitted 2 June, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
A Pairwise Comparison Relation-assisted Multi-objective Evolutionary Neural Architecture Search Method with Multi-population Mechanism
Authors:
Yu Xue,
Chenchen Zhu,
MengChu Zhou,
Mohamed Wahib,
Moncef Gabbouj
Abstract:
Neural architecture search (NAS) enables re-searchers to automatically explore vast search spaces and find efficient neural networks. But NAS suffers from a key bottleneck, i.e., numerous architectures need to be evaluated during the search process, which requires a lot of computing resources and time. In order to improve the efficiency of NAS, a series of methods have been proposed to reduce the…
▽ More
Neural architecture search (NAS) enables re-searchers to automatically explore vast search spaces and find efficient neural networks. But NAS suffers from a key bottleneck, i.e., numerous architectures need to be evaluated during the search process, which requires a lot of computing resources and time. In order to improve the efficiency of NAS, a series of methods have been proposed to reduce the evaluation time of neural architectures. However, they are not efficient enough and still only focus on the accuracy of architectures. In addition to the classification accuracy, more efficient and smaller network architectures are required in real-world applications. To address the above problems, we propose the SMEM-NAS, a pairwise com-parison relation-assisted multi-objective evolutionary algorithm based on a multi-population mechanism. In the SMEM-NAS, a surrogate model is constructed based on pairwise compari-son relations to predict the accuracy ranking of architectures, rather than the absolute accuracy. Moreover, two populations cooperate with each other in the search process, i.e., a main population guides the evolution, while a vice population expands the diversity. Our method aims to provide high-performance models that take into account multiple optimization objectives. We conduct a series of experiments on the CIFAR-10, CIFAR-100 and ImageNet datasets to verify its effectiveness. With only a single GPU searching for 0.17 days, competitive architectures can be found by SMEM-NAS which achieves 78.91% accuracy with the MAdds of 570M on the ImageNet. This work makes a significant advance in the important field of NAS.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier
Authors:
Aristeidis Tsaris,
Chengming Zhang,
Xiao Wang,
Junqi Yin,
Siyan Liu,
Moetasim Ashfaq,
Ming Fan,
Jong Youl Choi,
Mohamed Wahib,
Dan Lu,
Prasanna Balaprakash,
Feiyi Wang
Abstract:
Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text has inspired scaling sequence lengths in ViTs, yet adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to…
▽ More
Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text has inspired scaling sequence lengths in ViTs, yet adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training, achieving a 94% batch scaling efficiency on 2,048 AMD-MI250X GPUs. Evaluating sequence parallelism in ViTs, particularly in models up to 10B parameters, highlighted substantial bottlenecks. We countered these with hybrid sequence, pipeline, tensor parallelism, and flash attention strategies, to scale beyond single GPU memory limits. Our method significantly enhances climate modeling accuracy by 20% in temperature predictions, marking the first training of a transformer model on a full-attention matrix over 188K sequence length.
△ Less
Submitted 17 April, 2024;
originally announced May 2024.
-
Adaptive Patching for High-resolution Image Segmentation with Transformers
Authors:
Enzhi Zhang,
Isaac Lyngaas,
Peng Chen,
Xiao Wang,
Jun Igarashi,
Yuankai Huo,
Mohamed Wahib,
Masaharu Munetomo
Abstract:
Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attenti…
▽ More
Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attention-based model, if we are to use smaller patch sizes that are favorable in segmentation. The solution is to either use custom complex multi-resolution models or approximate attention schemes. We take inspiration from Adapative Mesh Refinement (AMR) methods in HPC by adaptively patching the images, as a pre-processing step, based on the image details to reduce the number of patches being fed to the model, by orders of magnitude. This method has a negligible overhead, and works seamlessly with any attention-based model, i.e. it is a pre-processing step that can be adopted by any attention-based model without friction. We demonstrate superior segmentation quality over SoTA segmentation models for real-world pathology datasets while gaining a geomean speedup of $6.9\times$ for resolutions up to $64K^2$, on up to $2,048$ GPUs.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
CG-Kit: Code Generation Toolkit for Performant and Maintainable Variants of Source Code Applied to Flash-X Hydrodynamics Simulations
Authors:
Johann Rudi,
Youngjun Lee,
Aidan H. Chadha,
Mohamed Wahib,
Klaus Weide,
Jared P. O'Neal,
Anshu Dubey
Abstract:
CG-Kit is a new code generation toolkit that we propose as a solution for portability and maintainability for scientific computing applications. The development of CG-Kit is rooted in the urgent need created by the shifting landscape of high-performance computing platforms and the algorithmic complexities of a particular large-scale multiphysics application: Flash-X. This combination leads to uniq…
▽ More
CG-Kit is a new code generation toolkit that we propose as a solution for portability and maintainability for scientific computing applications. The development of CG-Kit is rooted in the urgent need created by the shifting landscape of high-performance computing platforms and the algorithmic complexities of a particular large-scale multiphysics application: Flash-X. This combination leads to unique challenges including handling an existing large code base in Fortran and/or C/C++, subdivision of code into a great variety of units supporting a wide range of physics and numerical methods, different parallelization techniques for distributed- and shared-memory systems and accelerator devices, and heterogeneity of computing platforms requiring coexisting variants of parallel algorithms. The challenges demand that developers determine custom abstractions and granularity for code generation. CG-Kit tackles this with standalone tools that can be combined into highly specific and, we argue, highly effective portability and maintainability tool chains. Here we present the design of our new tools: parametrized source trees, control flow graphs, and recipes. The tools are implemented in Python. Although the tools are agnostic to the programming language of the source code, we focus on C/C++ and Fortran. Code generation experiments demonstrate the generation of variants of parallel algorithms: first, multithreaded variants of the basic AXPY operation (scalar-vector addition and vector-vector multiplication) to introduce the application of CG-Kit tool chains; and second, variants of parallel algorithms within a hydrodynamics solver, called Spark, from Flash-X that operates on block-structured adaptive meshes. In summary, code generated by CG-Kit achieves a reduction by over 60% of the original C/C++/Fortran source code.
△ Less
Submitted 6 January, 2024;
originally announced January 2024.
-
Ultra-Long Sequence Distributed Transformer
Authors:
Xiao Wang,
Isaac Lyngaas,
Aristeidis Tsaris,
Peng Chen,
Sajal Dash,
Mayanka Chandra Shekar,
Tao Luo,
Hong-Jun Yoon,
Mohamed Wahib,
John Gouley
Abstract:
Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements. Existing methods for long sequence training offer limited speedup and memory reduction, and may compromise accuracy. This paper presents a novel and efficient distr…
▽ More
Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements. Existing methods for long sequence training offer limited speedup and memory reduction, and may compromise accuracy. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer (LSS Transformer), for training transformer with long sequences. It distributes a long sequence into segments among GPUs, with each GPU computing a partial self-attention for its segment. Then, it uses a fused communication and a novel double gradient averaging technique to avoid the need to aggregate partial self-attention and minimize communication overhead. We evaluated the performance between LSS Transformer and the state-of-the-art Nvidia sequence parallelism on a Wikipedia enwik8 dataset. Results show that our proposed method lead to 5.6x faster and 10.2x more memory-efficient implementation compared to state-of-the-art sequence parallelism on 144 Nvidia V100 GPUs. Moreover, our algorithm scales to an extreme sequence length of 50,112 at 3,456 GPUs, achieving 161% super-linear parallel efficiency and a throughput of 32 petaflops.
△ Less
Submitted 8 November, 2023; v1 submitted 4 November, 2023;
originally announced November 2023.
-
KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training
Authors:
Truong Thao Nguyen,
Balazs Gerofi,
Edgar Josafat Martinez-Noriega,
François Trahay,
Mohamed Wahib
Abstract:
This paper proposes a method for hiding the least-important samples during the training of deep neural networks to increase efficiency, i.e., to reduce the cost of training. Using information about the loss and prediction confidence during training, we adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process, without significantly degrading ac…
▽ More
This paper proposes a method for hiding the least-important samples during the training of deep neural networks to increase efficiency, i.e., to reduce the cost of training. Using information about the loss and prediction confidence during training, we adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process, without significantly degrading accuracy. We explore the converge properties when accounting for the reduction in the number of SGD updates. Empirical results on various large-scale datasets and models used directly in image classification and segmentation show that while the with-replacement importance sampling algorithm performs poorly on large datasets, our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline. Code available at https://github.com/TruongThaoNguyen/kakurenbo
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Exploiting Scratchpad Memory for Deep Temporal Blocking: A case study for 2D Jacobian 5-point iterative stencil kernel (j2d5pt)
Authors:
Lingqi Zhang,
Mohamed Wahib,
Peng Chen,
Jintao Meng,
Xiao Wang,
Toshio Endo,
Satoshi Matsuoka
Abstract:
General Purpose Graphics Processing Units (GPGPU) are used in most of the top systems in HPC. The total capacity of scratchpad memory has increased by more than 40 times in the last decade. However, existing optimizations for stencil computations using temporal blocking have not aggressively exploited the large capacity of scratchpad memory. This work uses the 2D Jacobian 5-point iterative stencil…
▽ More
General Purpose Graphics Processing Units (GPGPU) are used in most of the top systems in HPC. The total capacity of scratchpad memory has increased by more than 40 times in the last decade. However, existing optimizations for stencil computations using temporal blocking have not aggressively exploited the large capacity of scratchpad memory. This work uses the 2D Jacobian 5-point iterative stencil as a case study to investigate the use of large scratchpad memory. Unlike existing research that tiles the domain in a thread block fashion, we tile the domain so that each tile is large enough to utilize all available scratchpad memory on the GPU. Consequently, we process several time steps inside a single tile before offloading the result back to global memory. Our evaluation shows that our performance is comparable to state-of-the-art implementations, yet our implementation is much simpler and does not require auto-generation of code.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Revisiting Temporal Blocking Stencil Optimizations
Authors:
Lingqi Zhang,
Mohamed Wahib,
Peng Chen,
Jintao Meng,
Xiao Wang,
Toshio Endo,
Satoshi Matsuoka
Abstract:
Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving t…
▽ More
Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We propose a novel temporal blocking method, EBISU, which champions low device occupancy to drive aggressive deep temporal blocking on large tiles that are executed tile-by-tile. We compare EBISU with state-of-the-art temporal blocking libraries: STENCILGEN and AN5D. We also compare with state-of-the-art stencil auto-tuning tools that are equipped with temporal blocking optimizations: ARTEMIS and DRSTENCIL. Over a wide range of stencil benchmarks, EBISU achieves speedups up to $2.53$x and a geometric mean speedup of $1.49$x over the best state-of-the-art performance in each stencil benchmark.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Myths and Legends in High-Performance Computing
Authors:
Satoshi Matsuoka,
Jens Domke,
Mohamed Wahib,
Aleksandr Drozd,
Torsten Hoefler
Abstract:
In this thought-provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We gathered these myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within and beyond our community. We believe they represent the zeitgeist of the c…
▽ More
In this thought-provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We gathered these myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within and beyond our community. We believe they represent the zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore's law. While some laws end, new directions are emerging, such as algorithmic scaling or novel architecture research. Nevertheless, these myths are rarely based on scientific facts, but rather on some evidence or argumentation. In fact, we believe that this is the very reason for the existence of many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates, such as whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment.
△ Less
Submitted 24 October, 2023; v1 submitted 6 January, 2023;
originally announced January 2023.
-
Flash-X, a multiphysics simulation software instrument
Authors:
Anshu Dubey,
Klaus Weide,
Jared O'Neal,
Akash Dhruv,
Sean Couch,
J. Austin Harris,
Tom Klosterman,
Rajeev Jain,
Johann Rudi,
Bronson Messer,
Michael Pajkos,
Jared Carlson,
Ran Chu,
Mohamed Wahib,
Saurabh Chawdhary,
Paul M. Ricker,
Dongwook Lee,
Katie Antypas,
Katherine M. Riley,
Christopher Daley,
Murali Ganapathy,
Francis X. Timmes,
Dean M. Townsley,
Marcos Vanella,
John Bachan
, et al. (6 additional authors not shown)
Abstract:
Flash-X is a highly composable multiphysics software system that can be used to simulate physical phenomena in several scientific domains. It derives some of its solvers from FLASH, which was first released in 2000. Flash-X has a new framework that relies on abstractions and asynchronous communications for performance portability across a range of increasingly heterogeneous hardware platforms. Fla…
▽ More
Flash-X is a highly composable multiphysics software system that can be used to simulate physical phenomena in several scientific domains. It derives some of its solvers from FLASH, which was first released in 2000. Flash-X has a new framework that relies on abstractions and asynchronous communications for performance portability across a range of increasingly heterogeneous hardware platforms. Flash-X is meant primarily for solving Eulerian formulations of applications with compressible and/or incompressible reactive flows. It also has a built-in, versatile Lagrangian framework that can be used in many different ways, including implementing tracers, particle-in-cell simulations, and immersed boundary methods.
△ Less
Submitted 24 August, 2022;
originally announced August 2022.
-
Image Gradient Decomposition for Parallel and Memory-Efficient Ptychographic Reconstruction
Authors:
Xiao Wang,
Aristeidis Tsaris,
Debangshu Mukherjee,
Mohamed Wahib,
Peng Chen,
Mark Oxley,
Olga Ovchinnikova,
Jacob Hinkle
Abstract:
Ptychography is a popular microscopic imaging modality for many scientific discoveries and sets the record for highest image resolution. Unfortunately, the high image resolution for ptychographic reconstruction requires significant amount of memory and computations, forcing many applications to compromise their image resolution in exchange for a smaller memory footprint and a shorter reconstructio…
▽ More
Ptychography is a popular microscopic imaging modality for many scientific discoveries and sets the record for highest image resolution. Unfortunately, the high image resolution for ptychographic reconstruction requires significant amount of memory and computations, forcing many applications to compromise their image resolution in exchange for a smaller memory footprint and a shorter reconstruction time. In this paper, we propose a novel image gradient decomposition method that significantly reduces the memory footprint for ptychographic reconstruction by tessellating image gradients and diffraction measurements into tiles. In addition, we propose a parallel image gradient decomposition method that enables asynchronous point-to-point communications and parallel pipelining with minimal overhead on a large number of GPUs. Our experiments on a Titanate material dataset (PbTiO3) with 16632 probe locations show that our Gradient Decomposition algorithm reduces memory footprint by 51 times. In addition, it achieves time-to-solution within 2.2 minutes by scaling to 4158 GPUs with a super-linear strong scaling efficiency at 364% compared to runtimes at 6 GPUs. This performance is 2.7 times more memory efficient, 9 times more scalable and 86 times faster than the state-of-the-art algorithm.
△ Less
Submitted 16 December, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Preparing for the Future -- Rethinking Proxy Apps
Authors:
Satoshi Matsuoka,
Jens Domke,
Mohamed Wahib,
Aleksandr Drozd,
Ray Bair,
Andrew A. Chien,
Jeffrey S. Vetter,
John Shalf
Abstract:
A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN's Supercomputer Fugaku, ANL's Aurora, or ORNL's Frontier. This process was necessary to standardize the procurement while avoiding duplicated effort at each HPC center t…
▽ More
A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN's Supercomputer Fugaku, ANL's Aurora, or ORNL's Frontier. This process was necessary to standardize the procurement while avoiding duplicated effort at each HPC center to develop their own benchmarks. Unfortunately, proxy applications force HPC centers and providers (vendors) into a an undesirable state of rigidity, in contrast to the fast-moving trends of current technology and future heterogeneity. To accommodate an extremely-heterogeneous future, we have to reconsider how to co-design supercomputers during the next decade, and avoid repeating the past mistakes. This position paper outlines the current state-of-the-art in system co-design, challenges encountered over the past years, and a proposed plan to move forward.
△ Less
Submitted 15 April, 2022;
originally announced April 2022.
-
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads
Authors:
Jens Domke,
Emil Vatai,
Balazs Gerofi,
Yuetsu Kodama,
Mohamed Wahib,
Artur Podobas,
Sparsh Mittal,
Miquel Pericàs,
Lingqi Zhang,
Peng Chen,
Aleksandr Drozd,
Satoshi Matsuoka
Abstract:
Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method…
▽ More
Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.
△ Less
Submitted 16 October, 2023; v1 submitted 5 April, 2022;
originally announced April 2022.
-
PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications
Authors:
Lingqi Zhang,
Mohamed Wahib,
Peng Chen,
Jintao Meng,
Xiao Wang,
Toshio Endo,
Satoshi Matsuoka
Abstract:
Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (…
▽ More
Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of $2.12$x for 2D stencils and $1.24$x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of $4.86$x in smaller SpMV datasets from SuiteSparse and $1.43$x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.
△ Less
Submitted 12 May, 2023; v1 submitted 5 April, 2022;
originally announced April 2022.
-
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems
Authors:
Steven Farrell,
Murali Emani,
Jacob Balma,
Lukas Drescher,
Aleksandr Drozd,
Andreas Fink,
Geoffrey Fox,
David Kanter,
Thorsten Kurth,
Peter Mattson,
Dawei Mu,
Amit Ruhela,
Kento Sato,
Koichi Shirahata,
Tsuguchika Tabaru,
Aristeidis Tsaris,
Jan Balewski,
Ben Cumming,
Takumi Danjo,
Jens Domke,
Takaaki Fukai,
Naoto Fukumoto,
Tatsuya Fukushi,
Balazs Gerofi,
Takumi Honda
, et al. (18 additional authors not shown)
Abstract:
Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning appli…
▽ More
Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications driven by the MLCommons Association. We present the results from the first submission round, including a diverse set of some of the world's largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization, and communication scheduling, enabling overall $>10 \times$ (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy, and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior to parameterize extended roofline performance models in future rounds.
△ Less
Submitted 26 October, 2021; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Performance Portable Back-projection Algorithms on CPUs: Agnostic Data Locality and Vectorization Optimizations
Authors:
Peng Chen,
Mohamed Wahib,
Xiao Wang,
Shinichiro Takizawa,
Takahiro Hirofuchi,
Hirotaka Ogawa,
Satoshi Matsuoka
Abstract:
Computed Tomography (CT) is a key 3D imaging technology that fundamentally relies on the compute-intense back-projection operation to generate 3D volumes. GPUs are typically used for back-projection in production CT devices. However, with the rise of power-constrained micro-CT devices, and also the emergence of CPUs comparable in performance to GPUs, back-projection for CPUs could become favorable…
▽ More
Computed Tomography (CT) is a key 3D imaging technology that fundamentally relies on the compute-intense back-projection operation to generate 3D volumes. GPUs are typically used for back-projection in production CT devices. However, with the rise of power-constrained micro-CT devices, and also the emergence of CPUs comparable in performance to GPUs, back-projection for CPUs could become favorable. Unlike GPUs, extracting parallelism for back-projection algorithms on CPUs is complex given that parallelism and locality are not explicitly defined and controlled by the programmer, as is the case when using CUDA for instance. We propose a collection of novel back-projection algorithms that reduce the arithmetic computation, robustly enable vectorization, enforce a regular memory access pattern, and maximize the data locality. We also implement the novel algorithms as efficient back-projection kernels that are performance portable over a wide range of CPUs. Performance evaluation using a variety of CPUs from different vendors and generations demonstrates that our back-projection implementation achieves on average 5.2x speedup over the multi-threaded implementation of the most widely used, and optimized, open library. With a state-of-the-art CPU, we reach performance that rivals top-performing GPUs.
△ Less
Submitted 27 April, 2021;
originally announced April 2021.
-
An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
Authors:
Albert Njoroge Kahira,
Truong Thao Nguyen,
Leonardo Bautista Gomez,
Ryousei Takano,
Rosa M Badia,
Mohamed Wahib
Abstract:
Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communicat…
▽ More
Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?
Authors:
Jens Domke,
Emil Vatai,
Aleksandr Drozd,
Peng Chen,
Yosuke Oyama,
Lingqi Zhang,
Shweta Salaria,
Daichi Mukunoki,
Artur Podobas,
Mohamed Wahib,
Satoshi Matsuoka
Abstract:
Matrix engines or units, in different forms and affinities, are becoming a reality in modern processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep Learning merits the commercial investments in these units, and deduced from the No.1 benchmark in supercomputing, namely High Performance Linpack, one would expect an awakened enthusiasm by the HPC community, too.
Hence…
▽ More
Matrix engines or units, in different forms and affinities, are becoming a reality in modern processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep Learning merits the commercial investments in these units, and deduced from the No.1 benchmark in supercomputing, namely High Performance Linpack, one would expect an awakened enthusiasm by the HPC community, too.
Hence, our goal is to identify the practical added benefits for HPC and machine learning applications by having access to matrix engines. For this purpose, we perform an in-depth survey of software stacks, proxy applications and benchmarks, and historical batch job records. We provide a cost-benefit analysis of matrix engines, both asymptotically and in conjunction with state-of-the-art processors. While our empirical data will temper the enthusiasm, we also outline opportunities to misuse these dense matrix-multiplication engines if they come for free.
△ Less
Submitted 27 February, 2021; v1 submitted 27 October, 2020;
originally announced October 2020.
-
GTOPX Space Mission Benchmarks
Authors:
Martin Schlueter,
Mehdi Neshat,
Mohamed Wahib,
Masaharu Munetomo,
Markus Wagner
Abstract:
This contribution introduces the GTOPX space mission benchmark collection, which is an extension of GTOP database published by the European Space Agency (ESA). GTOPX consists of ten individual benchmark instances representing real-world interplanetary space trajectory design problems. In regard to the original GTOP collection, GTOPX includes three new problem instances featuring mixed-integer and…
▽ More
This contribution introduces the GTOPX space mission benchmark collection, which is an extension of GTOP database published by the European Space Agency (ESA). GTOPX consists of ten individual benchmark instances representing real-world interplanetary space trajectory design problems. In regard to the original GTOP collection, GTOPX includes three new problem instances featuring mixed-integer and multi-objective properties. GTOPX enables a simplified user handling, unified benchmark function call and some minor bug corrections to the original GTOP implementation. Furthermore, GTOPX is linked from it's original C++ source code to Python and Matlab based on dynamic link libraries, assuring computationally fast and accurate reproduction of the benchmark results in all three programming languages. Space mission trajectory design problems as those represented in GTOPX are known to be highly non-linear and difficult to solve. The GTOPX collection, therefore, aims particularly at researchers wishing to put advanced (meta)heuristic and hybrid optimization algorithms to the test. The goal of this paper is to provide researchers with a manual and reference to the newly available GTOPX benchmark software.
△ Less
Submitted 17 February, 2021; v1 submitted 15 October, 2020;
originally announced October 2020.
-
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
Authors:
Mohamed Wahib,
Haoyu Zhang,
Truong Thao Nguyen,
Aleksandr Drozd,
Jens Domke,
Lingqi Zhang,
Ryousei Takano,
Satoshi Matsuoka
Abstract:
The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in additi…
▽ More
The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism. We propose a performance model based on the concurrency analysis of out-of-core training behavior, and derive a strategy that combines layer swapping and redundant recomputing. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. We also introduce the first method to solve the challenging problem of out-of-core multi-node training by carefully pipelining gradient exchanges and performing the parameter updates on the host. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
△ Less
Submitted 26 August, 2020;
originally announced August 2020.
-
A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
Authors:
Fareed Qararyah,
Mohamed Wahib,
Doğa Dikbayır,
Mehmet Esat Belviranli,
Didem Unat
Abstract:
Many state-of-the-art Deep Neural Networks (DNNs) have substantial memory requirements. Limited device memory becomes a bottleneck when training those models. We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs. ParDNN decides a placement of DNN's underlying computational graph operations across multiple devices so…
▽ More
Many state-of-the-art Deep Neural Networks (DNNs) have substantial memory requirements. Limited device memory becomes a bottleneck when training those models. We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs. ParDNN decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is minimized. ParDNN is completely independent of the deep learning aspects of a DNN. It requires no modification neither at the model nor at the systems level implementation of its operation kernels. ParDNN partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate efficient training of 5 very large models while achieving superlinear scaling for both the batch size and training throughput. ParDNN either outperforms or qualitatively improves upon the related work.
△ Less
Submitted 5 May, 2021; v1 submitted 19 August, 2020;
originally announced August 2020.
-
A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs
Authors:
Lingqi Zhang,
Mohamed Wahib,
Haoyu Zhang,
Satoshi Matsuoka
Abstract:
GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia's latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of thos…
▽ More
GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia's latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of those synchronization methods. This work explores important undocumented features and provides an in-depth analysis of the performance considerations and pitfalls of the state-of-art synchronization methods for Nvidia GPUs. The provided analysis would be useful when making design choices for applications, libraries, and frameworks running on single and/or multi-GPU environments. We provide a case study of the commonly used reduction operator to illustrate how the knowledge gained in our analysis can be useful. We also describe our micro-benchmarks and measurement methods.
△ Less
Submitted 11 April, 2020;
originally announced April 2020.
-
AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
Authors:
Kazuaki Matsumura,
Hamid Reza Zohouri,
Mohamed Wahib,
Toshio Endo,
Satoshi Matsuoka
Abstract:
Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architect…
▽ More
Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.
△ Less
Submitted 3 February, 2020; v1 submitted 6 January, 2020;
originally announced January 2020.
-
iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction
Authors:
Peng Chen,
Mohamed Wahib,
Shinichiro Takizawa,
Ryousei Takano,
Satoshi Matsuoka
Abstract:
Computed Tomography (CT) is a widely used technology that requires compute-intense algorithms for image reconstruction. We propose a novel back-projection algorithm that reduces the projection computation cost to 1/6 of the standard algorithm. We also propose an efficient implementation that takes advantage of the heterogeneity of GPU-accelerated systems by overlapping the filtering and back-proje…
▽ More
Computed Tomography (CT) is a widely used technology that requires compute-intense algorithms for image reconstruction. We propose a novel back-projection algorithm that reduces the projection computation cost to 1/6 of the standard algorithm. We also propose an efficient implementation that takes advantage of the heterogeneity of GPU-accelerated systems by overlapping the filtering and back-projection stages on CPUs and GPUs, respectively. Finally, we propose a distributed framework for high-resolution image reconstruction on state-of-the-art GPU-accelerated supercomputers. The framework relies on an elaborate interleave of MPI collective communication steps to achieve scalable communication. Evaluation on a single Tesla V100 GPU demonstrates that our back-projection kernel performs up to 1.6x faster than the standard FDK implementation. We also demonstrate the scalability and instantaneous CT capability of the distributed framework by using up to 2,048 V100 GPUs to solve 4K and 8K problems within 30 seconds and 2 minutes, respectively (including I/O).
△ Less
Submitted 6 September, 2019;
originally announced September 2019.
-
A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels
Authors:
Peng Chen,
Mohamed Wahib,
Shinichiro Takizawa,
Ryousei Takano,
Satoshi Matsuoka
Abstract:
This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versa…
▽ More
This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on V100 and P100 GPUs.
△ Less
Submitted 6 September, 2019; v1 submitted 13 July, 2019;
originally announced July 2019.
-
Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?
Authors:
Jens Domke,
Kazuaki Matsumura,
Mohamed Wahib,
Haoyu Zhang,
Keita Yashima,
Toshiki Tsuchikawa,
Yohei Tsuji,
Artur Podobas,
Satoshi Matsuoka
Abstract:
Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this view.
In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC…
▽ More
Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this view.
In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy application on two processors: Intel's Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNM and KNL architecturally deviate at one important point: the silicon area devoted to double-precision arithmetic's. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic.
Our analysis shows that this common wisdom might not always be right. We find that the investigated HPC proxy applications do allow for a (significant) reduction in double-precision with little-to-no performance implications. With the advent of a failing of Moore's law, our results partially reinforce the view taken by modern industry (e.g. upcoming Fujitsu ARM64FX) to integrate hybrid-precision hardware units.
△ Less
Submitted 25 March, 2019; v1 submitted 22 October, 2018;
originally announced October 2018.