Search | arXiv e-print repository

doi 10.1145/3581576.3581610

Wilson matrix kernel for lattice QCD on A64FX architecture

Authors: Issaku Kanamori, Keigo Nitadori, Hideo Matsufuru

Abstract: We study the implementation of the even-odd Wilson fermion matrix for lattice QCD simulations on the A64FX architecture. Efficient coding of the stencil operation is investigated for two-dimensional packing to SIMD vectors. We measure the sustained performance on the supercomputer Fugaku at RIKEN R-CCS and show the profiler result of our code, which may signal an unexpected source of slow-down in… ▽ More We study the implementation of the even-odd Wilson fermion matrix for lattice QCD simulations on the A64FX architecture. Efficient coding of the stencil operation is investigated for two-dimensional packing to SIMD vectors. We measure the sustained performance on the supercomputer Fugaku at RIKEN R-CCS and show the profiler result of our code, which may signal an unexpected source of slow-down in addition to the detailed efficiency of each part of the code. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: 10 pages, contribtuion to the International Workshop on Arm-based HPC: Practice and Experience (IWAHPCE-2023), held in conjunction with The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2023), Singapore, Feb 27 - March 2, 2023

arXiv:1804.08935 [pdf, ps, other]

doi 10.1093/pasj/psy062

Fortran interface layer of the framework for developing particle simulator FDPS

Authors: Daisuke Namekata, Masaki Iwasawa, Keigo Nitadori, Ataru Tanikawa, Takayuki Muranushi, Long Wang, Natsuki Hosono, Kentaro Nomura, Junichiro Makino

Abstract: Numerical simulations based on particle methods have been widely used in various fields including astrophysics. To date, simulation softwares have been developed by individual researchers or research groups in each field, with a huge amount of time and effort, even though numerical algorithms used are very similar. To improve the situation, we have developed a framework, called FDPS, which enables… ▽ More Numerical simulations based on particle methods have been widely used in various fields including astrophysics. To date, simulation softwares have been developed by individual researchers or research groups in each field, with a huge amount of time and effort, even though numerical algorithms used are very similar. To improve the situation, we have developed a framework, called FDPS, which enables researchers to easily develop massively parallel particle simulation codes for arbitrary particle methods. Until version 3.0, FDPS have provided API only for C++ programing language. This limitation comes from the fact that FDPS is developed using the template feature in C++, which is essential to support arbitrary data types of particle. However, there are many researchers who use Fortran to develop their codes. Thus, the previous versions of FDPS require such people to invest much time to learn C++. This is inefficient. To cope with this problem, we newly developed a Fortran interface layer in FDPS, which provides API for Fortran. In order to support arbitrary data types of particle in Fortran, we design the Fortran interface layer as follows. Based on a given derived data type in Fortran representing particle, a Python script provided by us automatically generates a library that manipulates the C++ core part of FDPS. This library is seen as a Fortran module providing API of FDPS from the Fortran side and uses C programs internally to interoperate Fortran with C++. In this way, we have overcome several technical issues when emulating `template' in Fortran. By using the Fortran interface, users can develop all parts of their codes in Fortran. We show that the overhead of the Fortran interface part is sufficiently small and a code written in Fortran shows a performance practically identical to the one written in C++. △ Less

Submitted 25 April, 2018; v1 submitted 24 April, 2018; originally announced April 2018.

Comments: 10 pages, 10 figures; accepted for publication in PASJ; a typo in author name is corrected

arXiv:1612.00530 [pdf, ps, other]

Implementation and evaluation of data-compression algorithms for irregular-grid iterative methods on the PEZY-SC processor

Authors: Naoki Yoshifuji, Ryo Sakamoto, Keigo Nitadori, Jun Makino

Abstract: Iterative methods on irregular grids have been used widely in all areas of comptational science and engineering for solving partial differential equations with complex geometry. They provide the flexibility to express complex shapes with relatively low computational cost. However, the direction of the evolution of high-performance processors in the last two decades have caused serious degradation… ▽ More Iterative methods on irregular grids have been used widely in all areas of comptational science and engineering for solving partial differential equations with complex geometry. They provide the flexibility to express complex shapes with relatively low computational cost. However, the direction of the evolution of high-performance processors in the last two decades have caused serious degradation of the computational efficiency of iterative methods on irregular grids, because of relatively low memory bandwidth. Data compression can in principle reduce the necessary memory memory bandwidth of iterative methods and thus improve the efficiency. We have implemented several data compression algorithms on the PEZY-SC processor, using the matrix generated for the HPCG benchmark as an example. For the SpMV (Sparse Matrix-Vector multiplication) part of the HPCG benchmark, the best implementation without data compression achieved 11.6Gflops/chip, close to the theoretical limit due to the memory bandwidth. Our implementation with data compression has achieved 32.4Gflops. This is of course rather extreme case, since the grid used in HPCG is geometrically regular and thus its compression efficiency is very high. However, in real applications, it is in many cases possible to make a large part of the grid to have regular geometry, in particular when the resolution is high. Note that we do not need to change the structure of the program, except for the addition of the data compression/decompression subroutines. Thus, we believe the data compression will be very useful way to improve the performance of many applications which rely on the use of irregular grids. △ Less

Submitted 1 December, 2016; originally announced December 2016.

Comments: Talk given at IA3 2016 Sixth Workshop on Irregular Applications: Architectures and Algorithms http://hpc.pnl.gov/IA3/IA3/Program.html

arXiv:1412.0659 [pdf, other]

doi 10.1109/SC.2014.10

24.77 Pflops on a Gravitational Tree-Code to Simulate the Milky Way Galaxy with 18600 GPUs

Authors: Jeroen Bédorf, Evghenii Gaburov, Michiko S. Fujii, Keigo Nitadori, Tomoaki Ishiyama, Simon Portegies Zwart

Abstract: We have simulated, for the first time, the long term evolution of the Milky Way Galaxy using 51 billion particles on the Swiss Piz Daint supercomputer with our $N$-body gravitational tree-code Bonsai. Herein, we describe the scientific motivation and numerical algorithms. The Milky Way model was simulated for 6 billion years, during which the bar structure and spiral arms were fully formed. This i… ▽ More We have simulated, for the first time, the long term evolution of the Milky Way Galaxy using 51 billion particles on the Swiss Piz Daint supercomputer with our $N$-body gravitational tree-code Bonsai. Herein, we describe the scientific motivation and numerical algorithms. The Milky Way model was simulated for 6 billion years, during which the bar structure and spiral arms were fully formed. This improves upon previous simulations by using 1000 times more particles, and provides a wealth of new data that can be directly compared with observations. We also report the scalability on both the Swiss Piz Daint and the US ORNL Titan. On Piz Daint the parallel efficiency of Bonsai was above 95%. The highest performance was achieved with a 242 billion particle Milky Way model using 18600 GPUs on Titan, thereby reaching a sustained GPU and application performance of 33.49 Pflops and 24.77 Pflops respectively. △ Less

Submitted 1 December, 2014; originally announced December 2014.

Comments: 12 pages, 4 figures, Published in: 'Proceeding SC '14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis'. Gordon Bell Prize 2014 finalist

arXiv:1001.0773 [pdf, ps, other]

doi 10.1109/MC.2009.419

Simulating the universe on an intercontinental grid of supercomputers

Authors: Simon Portegies Zwart, Tomoaki Ishiyama, Derek Groen, Keigo Nitadori, Junichiro Makino, Cees de Laat, Stephen McMillan, Kei Hiraki, Stefan Harfst, Paola Grosso

Abstract: Understanding the universe is hampered by the elusiveness of its most common constituent, cold dark matter. Almost impossible to observe, dark matter can be studied effectively by means of simulation and there is probably no other research field where simulation has led to so much progress in the last decade. Cosmological N-body simulations are an essential tool for evolving density perturbation… ▽ More Understanding the universe is hampered by the elusiveness of its most common constituent, cold dark matter. Almost impossible to observe, dark matter can be studied effectively by means of simulation and there is probably no other research field where simulation has led to so much progress in the last decade. Cosmological N-body simulations are an essential tool for evolving density perturbations in the nonlinear regime. Simulating the formation of large-scale structures in the universe, however, is still a challenge due to the enormous dynamic range in spatial and temporal coordinates, and due to the enormous computer resources required. The dynamic range is generally dealt with by the hybridization of numerical techniques. We deal with the computational requirements by connecting two supercomputers via an optical network and make them operate as a single machine. This is challenging, if only for the fact that the supercomputers of our choice are separated by half the planet, as one is located in Amsterdam and the other is in Tokyo. The co-scheduling of the two computers and the 'gridification' of the code enables us to achieve a 90% efficiency for this distributed intercontinental supercomputer. △ Less

Submitted 5 January, 2010; originally announced January 2010.

Comments: Accepted for publication in IEEE Computer

Showing 1–5 of 5 results for author: Nitadori, K