-
ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code
Authors:
Kazuaki Matsumura,
Simon Garcia De Gonzalo,
Antonio J. Peña
Abstract:
Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they require repetitive implementation to perform similar analyses due to the lack of cooperation. To address this issue, modern optimization techniques, such as equa…
▽ More
Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they require repetitive implementation to perform similar analyses due to the lack of cooperation. To address this issue, modern optimization techniques, such as equality saturation, allow for exhaustive term rewriting at various levels of inputs, thereby simplifying compiler design.
In this paper, we propose equality saturation to optimize sequential codes utilized in directive-based programming for GPUs. Our approach realizes less computation, less memory access, and high memory throughput simultaneously. Our fully-automated framework constructs single-assignment forms from inputs to be entirely rewritten while keeping dependencies and extracts optimal cases. Through practical benchmarks, we demonstrate a significant performance improvement on several compilers. Furthermore, we highlight the advantages of computational reordering and emphasize the significance of memory-access order for modern GPUs.
△ Less
Submitted 17 September, 2024; v1 submitted 22 June, 2023;
originally announced June 2023.
-
A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code
Authors:
Kazuaki Matsumura,
Simon Garcia De Gonzalo,
Antonio J. Peña
Abstract:
Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from…
▽ More
Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from making additional low-level optimizations to take advantage of the advanced architectural features of GPUs because the actual generated computation is hidden from the application developer.
This paper describes and implements a novel flexible optimization technique that operates by inserting a code emulator phase to the tail-end of the compilation pipeline. Our tool emulates the generated code using symbolic analysis by substituting dynamic information and thus allowing for further low-level code optimizations to be applied. We implement our tool to support both CUDA and OpenACC directives as the frontend of the compilation pipeline, thus enabling low-level GPU optimizations for OpenACC that were not previously possible. We demonstrate the capabilities of our tool by automating warp-level shuffle instructions that are difficult to use by even advanced GPU programmers. Lastly, evaluating our tool with a benchmark suite and complex application code, we provide a detailed study to assess the benefits of shuffle instructions across four generations of GPU architectures.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
Near-optimal stochastic MIMO signal detection with a mixture of t-distribution prior
Authors:
Junichiro Hagiwara,
Kazushi Matsumura,
Hiroki Asumi,
Yukiko Kasuga,
Toshihiko Nishimura,
Takanori Sato,
Yasutaka Ogawa,
Takeo Ohgane
Abstract:
Multiple-input multiple-output (MIMO) systems will play a crucial role in future wireless communication, but improving their signal detection performance to increase transmission efficiency remains a challenge. To address this issue, we propose extending the discrete signal detection problem in MIMO systems to a continuous one and applying the Hamiltonian Monte Carlo method, an efficient Markov ch…
▽ More
Multiple-input multiple-output (MIMO) systems will play a crucial role in future wireless communication, but improving their signal detection performance to increase transmission efficiency remains a challenge. To address this issue, we propose extending the discrete signal detection problem in MIMO systems to a continuous one and applying the Hamiltonian Monte Carlo method, an efficient Markov chain Monte Carlo algorithm. In our previous studies, we have used a mixture of normal distributions for the prior distribution. In this study, we propose using a mixture of t-distributions, which further improves detection performance. Based on our theoretical analysis and computer simulations, the proposed method can achieve near-optimal signal detection with polynomial computational complexity. This high-performance and practical MIMO signal detection could contribute to the development of the 6th-generation mobile network.
△ Less
Submitted 7 March, 2024; v1 submitted 9 January, 2023;
originally announced January 2023.
-
An Estimation Framework for Passerby Engagement Interacting with Social Robots
Authors:
Taichi Sakaguchi,
Yuki Okafuji,
Kohei Matsumura,
Jun Baba,
Junya Nakanishi
Abstract:
Social robots are expected to be a human labor support technology, and one application of them is an advertising medium in public spaces. When social robots provide information, such as recommended shops, adaptive communication according to the user's state is desired. User engagement, which is also defined as the level of interest in the robot, is likely to play an important role in adaptive comm…
▽ More
Social robots are expected to be a human labor support technology, and one application of them is an advertising medium in public spaces. When social robots provide information, such as recommended shops, adaptive communication according to the user's state is desired. User engagement, which is also defined as the level of interest in the robot, is likely to play an important role in adaptive communication. Therefore, in this paper, we propose a new framework to estimate user engagement. The proposed method focuses on four unsolved open problems: multi-party interactions, process of state change in engagement, difficulty in annotating engagement, and interaction dataset in the real world. The accuracy of the proposed method for estimating engagement was evaluated using interaction duration. The results show that the interaction duration can be accurately estimated by considering the influence of the behaviors of other people; this also implies that the proposed model accurately estimates the level of engagement during interaction with the robot.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization
Authors:
Kazuaki Matsumura,
Simon Garcia De Gonzalo,
Antonio J. Peña
Abstract:
The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures while programmers are only required to add meta information upon sequential…
▽ More
The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures while programmers are only required to add meta information upon sequential code. Optimizations for obtaining the best possible efficiency, however, are often challenging. The insertions of directives by the programmer can lead to side-effects that limit the available compiler optimization possible, which could result in performance degradation. This is exacerbated when targeting multi-GPU systems, as pragmas do not automatically adapt to such systems, and require expensive and time consuming code adjustment by programmers.
This paper introduces JACC, an OpenACC runtime framework which enables the dynamic extension of OpenACC programs by serving as a transparent layer between the program and the compiler. We add a versatile code-translation method for multi-device utilization by which manually-optimized applications can be distributed automatically while keeping original code structure and parallelism. We show in some cases nearly linear scaling on the part of kernel execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the resulting performance improvements amortize the latency of GPU-to-GPU communications.
△ Less
Submitted 27 April, 2022; v1 submitted 27 October, 2021;
originally announced October 2021.
-
AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
Authors:
Kazuaki Matsumura,
Hamid Reza Zohouri,
Mohamed Wahib,
Toshio Endo,
Satoshi Matsuoka
Abstract:
Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architect…
▽ More
Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.
△ Less
Submitted 3 February, 2020; v1 submitted 6 January, 2020;
originally announced January 2020.
-
Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?
Authors:
Jens Domke,
Kazuaki Matsumura,
Mohamed Wahib,
Haoyu Zhang,
Keita Yashima,
Toshiki Tsuchikawa,
Yohei Tsuji,
Artur Podobas,
Satoshi Matsuoka
Abstract:
Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this view.
In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC…
▽ More
Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this view.
In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy application on two processors: Intel's Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNM and KNL architecturally deviate at one important point: the silicon area devoted to double-precision arithmetic's. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic.
Our analysis shows that this common wisdom might not always be right. We find that the investigated HPC proxy applications do allow for a (significant) reduction in double-precision with little-to-no performance implications. With the advent of a failing of Moore's law, our results partially reinforce the view taken by modern industry (e.g. upcoming Fujitsu ARM64FX) to integrate hybrid-precision hardware units.
△ Less
Submitted 25 March, 2019; v1 submitted 22 October, 2018;
originally announced October 2018.
-
Acoustic Probing for Estimating the Storage Time and Firmness of Tomatoes and Mandarin Oranges
Authors:
Hidetomo Kataoka,
Takashi Ijiri,
Kohei Matsumura,
Jeremy White,
Akira Hirabayashi
Abstract:
This paper introduces an acoustic probing technique to estimate the storage time and firmness of fruits; we emit an acoustic signal to fruit from a small speaker and capture the reflected signal with a tiny microphone. We collect reflected signals for fruits with various storage times and firmness conditions, using them to train regressors for estimation. To evaluate the feasibility of our acousti…
▽ More
This paper introduces an acoustic probing technique to estimate the storage time and firmness of fruits; we emit an acoustic signal to fruit from a small speaker and capture the reflected signal with a tiny microphone. We collect reflected signals for fruits with various storage times and firmness conditions, using them to train regressors for estimation. To evaluate the feasibility of our acoustic probing, we performed experiments; we prepared 162 tomatoes and 153 mandarin oranges, collected their reflected signals using our developed device and measured their firmness with a fruit firmness tester, for a period of 35 days for tomatoes and 60 days for mandarin oranges. We performed cross validation by using this data set. The average estimation errors of storage time and firmness for tomatoes were 0.89 days and 9.47 g/mm2. Those for mandarin oranges were 1.67 days and 15.67 g/mm2. The estimation of storage time was sufficiently accurate for casual users to select fruits in their favorite condition at home. In the experiments, we tested four different acoustic probes and found that sweep signals provide highly accurate estimation results.
△ Less
Submitted 30 April, 2019; v1 submitted 27 September, 2018;
originally announced September 2018.
-
Research Activity Classification based on Time Series Bibliometrics
Authors:
Takahiro Kawamura,
Yasuhiro Yamashita,
Katsuji Matsumura
Abstract:
Bibliometrics such as the number of papers and times cited are often used to compare researchers based on specific criteria. The criteria, however, are different in each research domain and are set by empirical laws. Moreover, there are arguments, such that the simple sum of metric values works to the advantage of elders. Therefore, this paper attempts to constitute features from time series data…
▽ More
Bibliometrics such as the number of papers and times cited are often used to compare researchers based on specific criteria. The criteria, however, are different in each research domain and are set by empirical laws. Moreover, there are arguments, such that the simple sum of metric values works to the advantage of elders. Therefore, this paper attempts to constitute features from time series data of bibliometrics, and then classify the researchers according to the features. In detail, time series patterns are extracted from bibliographic data sets, and then a model to classify whether the researchers are "distinguished" or not is created by a machine learning technique. The experiments achieved an F-measure of 80.0% in the classification of 114 researchers in two research domains based on the data sets of Japan Science and Technology Agency and Elsevier's Scopus. In the future, we will conduct verification on a number of researchers in several domains, and then make use of discovering "distinguished" researchers, who are not widely known.
△ Less
Submitted 4 August, 2017;
originally announced August 2017.