-
Fortran interface layer of the framework for developing particle simulator FDPS
Authors:
Daisuke Namekata,
Masaki Iwasawa,
Keigo Nitadori,
Ataru Tanikawa,
Takayuki Muranushi,
Long Wang,
Natsuki Hosono,
Kentaro Nomura,
Junichiro Makino
Abstract:
Numerical simulations based on particle methods have been widely used in various fields including astrophysics. To date, simulation softwares have been developed by individual researchers or research groups in each field, with a huge amount of time and effort, even though numerical algorithms used are very similar. To improve the situation, we have developed a framework, called FDPS, which enables…
▽ More
Numerical simulations based on particle methods have been widely used in various fields including astrophysics. To date, simulation softwares have been developed by individual researchers or research groups in each field, with a huge amount of time and effort, even though numerical algorithms used are very similar. To improve the situation, we have developed a framework, called FDPS, which enables researchers to easily develop massively parallel particle simulation codes for arbitrary particle methods. Until version 3.0, FDPS have provided API only for C++ programing language. This limitation comes from the fact that FDPS is developed using the template feature in C++, which is essential to support arbitrary data types of particle. However, there are many researchers who use Fortran to develop their codes. Thus, the previous versions of FDPS require such people to invest much time to learn C++. This is inefficient. To cope with this problem, we newly developed a Fortran interface layer in FDPS, which provides API for Fortran. In order to support arbitrary data types of particle in Fortran, we design the Fortran interface layer as follows. Based on a given derived data type in Fortran representing particle, a Python script provided by us automatically generates a library that manipulates the C++ core part of FDPS. This library is seen as a Fortran module providing API of FDPS from the Fortran side and uses C programs internally to interoperate Fortran with C++. In this way, we have overcome several technical issues when emulating `template' in Fortran. By using the Fortran interface, users can develop all parts of their codes in Fortran. We show that the overhead of the Fortran interface part is sufficiently small and a code written in Fortran shows a performance practically identical to the one written in C++.
△ Less
Submitted 25 April, 2018; v1 submitted 24 April, 2018;
originally announced April 2018.
-
A development of an accelerator board dedicated for multi-precision arithmetic operations and its application to Feynman loop integrals II
Authors:
H Daisaka,
N Nakasato,
T Ishikawa,
F Yuasa,
K Nitadori
Abstract:
Evaluation of a wide variety of Feynman diagrams with multi-loop integrals and physical parameters and its comparison with high energy experiments are expected to investigate new physics beyond the Standard Model. We have been developing a direct computation method of multi-loop integrals of Feynman diagrams. One of features of our method is that we adopt the double exponential rule for numerical…
▽ More
Evaluation of a wide variety of Feynman diagrams with multi-loop integrals and physical parameters and its comparison with high energy experiments are expected to investigate new physics beyond the Standard Model. We have been developing a direct computation method of multi-loop integrals of Feynman diagrams. One of features of our method is that we adopt the double exponential rule for numerical integrations which enables us to evaluate loop integrals with boundary singularities. Another feature is that in order to accelerate the numerical integrations with multi-precision calculations, we develop an accelerator system with Field Programmable Gate Array boards on which processing elements with dedicated logic for quadruple/hexuple/octuple precision arithmetic operations are implemented. In addition, we also develop a programming interface designed for easy use of the system. The development is continued for practical use of the system. We present the current development of our system, and the numerical results of higher-loop diagrams performed using our system.
△ Less
Submitted 19 March, 2018;
originally announced March 2018.
-
Implementation and performance of FDPS: A Framework Developing Parallel Particle Simulation Codes
Authors:
Masaki Iwasawa,
Ataru Tanikawa,
Natsuki Hosono,
Keigo Nitadori,
Takayuki Muranushi,
Junichiro Makino
Abstract:
We present the basic idea, implementation, measured performance and performance model of FDPS (Framework for developing particle simulators). FDPS is an application-development framework which helps the researchers to develop particle-based simulation programs for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers n…
▽ More
We present the basic idea, implementation, measured performance and performance model of FDPS (Framework for developing particle simulators). FDPS is an application-development framework which helps the researchers to develop particle-based simulation programs for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, redistribution of particles, and gathering of particle information for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as Barnes-Hut tree method should be used for long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are necessary. FDPS provides all of these necessary functions for efficient parallel execution of particle-based simulations as "templates", which are independent of the actual data structure of particles and the functional form of the interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N^2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speedup was obtained for up to the full system of K computer. The minimum calculation time per timestep is in the range of 30 ms (N=10^7) to 300 ms (N=10^9). These are currently limited by the time for the calculation of the domain decomposition and communication necessary for the interaction calculation. We discuss how we can overcome these bottlenecks.
△ Less
Submitted 24 April, 2016; v1 submitted 13 January, 2016;
originally announced January 2016.
-
Particle mesh multipole method: An efficient solver for gravitational/electrostatic forces based on multipole method and fast convolution over a uniform mesh
Authors:
Keigo Nitadori
Abstract:
We propose an efficient algorithm for the evaluation of the potential and its gradient of gravitational/electrostatic $N$-body systems, which we call particle mesh multipole method (PMMM or PM$^3$). PMMM can be understood both as an extension of the particle mesh (PM) method and as an optimization of the fast multipole method (FMM).In the former viewpoint, the scalar density and potential held by…
▽ More
We propose an efficient algorithm for the evaluation of the potential and its gradient of gravitational/electrostatic $N$-body systems, which we call particle mesh multipole method (PMMM or PM$^3$). PMMM can be understood both as an extension of the particle mesh (PM) method and as an optimization of the fast multipole method (FMM).In the former viewpoint, the scalar density and potential held by a grid point are extended to multipole moments and local expansions in $(p+1)^2$ real numbers, where $p$ is the order of expansion. In the latter viewpoint, a hierarchical octree structure which brings its $\mathcal O(N)$ nature, is replaced with a uniform mesh structure, and we exploit the convolution theorem with fast Fourier transform (FFT) to speed up the calculations. Hence, independent $(p+1)^2$ FFTs with the size equal to the number of grid points are performed.
The fundamental idea is common to PPPM/MPE by Shimada et al. (1993) and FFTM by Ong et al. (2003). PMMM differs from them in supporting both the open and periodic boundary conditions, and employing an irreducible form where both the multipole moments and local expansions are expressed in $(p+1)^2$ real numbers and the transformation matrices in $(2p+1)^2$ real numbers.
The computational complexity is the larger of $\mathcal O(p^2 N)$ and $\mathcal O(N \log (N/p^2))$, and the memory demand is $\mathcal O(N)$ when the number of grid points is $\propto N/p^2$.
△ Less
Submitted 17 October, 2014; v1 submitted 21 September, 2014;
originally announced September 2014.
-
4.45 Pflops Astrophysical N-Body Simulation on K computer -- The Gravitational Trillion-Body Problem
Authors:
Tomoaki Ishiyama,
Keigo Nitadori,
Junichiro Makino
Abstract:
As an entry for the 2012 Gordon-Bell performance prize, we report performance results of astrophysical N-body simulations of one trillion particles performed on the full system of K computer. This is the first gravitational trillion-body simulation in the world. We describe the scientific motivation, the numerical algorithm, the parallelization strategy, and the performance analysis. Unlike many p…
▽ More
As an entry for the 2012 Gordon-Bell performance prize, we report performance results of astrophysical N-body simulations of one trillion particles performed on the full system of K computer. This is the first gravitational trillion-body simulation in the world. We describe the scientific motivation, the numerical algorithm, the parallelization strategy, and the performance analysis. Unlike many previous Gordon-Bell prize winners that used the tree algorithm for astrophysical N-body simulations, we used the hybrid TreePM method, for similar level of accuracy in which the short-range force is calculated by the tree algorithm, and the long-range force is solved by the particle-mesh algorithm. We developed a highly-tuned gravity kernel for short-range forces, and a novel communication algorithm for long-range forces. The average performance on 24576 and 82944 nodes of K computer are 1.53 and 4.45 Pflops, which correspond to 49% and 42% of the peak speed.
△ Less
Submitted 13 April, 2015; v1 submitted 19 November, 2012;
originally announced November 2012.
-
Accelerating NBODY6 with Graphics Processing Units
Authors:
Keigo Nitadori,
Sverre J. Aarseth
Abstract:
We describe the use of Graphics Processing Units (GPUs) for speeding up the code NBODY6 which is widely used for direct $N$-body simulations. Over the years, the $N^2$ nature of the direct force calculation has proved a barrier for extending the particle number. Following an early introduction of force polynomials and individual time-steps, the calculation cost was first reduced by the introductio…
▽ More
We describe the use of Graphics Processing Units (GPUs) for speeding up the code NBODY6 which is widely used for direct $N$-body simulations. Over the years, the $N^2$ nature of the direct force calculation has proved a barrier for extending the particle number. Following an early introduction of force polynomials and individual time-steps, the calculation cost was first reduced by the introduction of a neighbour scheme. After a decade of GRAPE computers which speeded up the force calculation further, we are now in the era of GPUs where relatively small hardware systems are highly cost-effective. A significant gain in efficiency is achieved by employing the GPU to obtain the so-called regular force which typically involves some 99 percent of the particles, while the remaining local forces are evaluated on the host. However, the latter operation is performed up to 20 times more frequently and may still account for a significant cost. This effort is reduced by parallel SSE/AVX procedures where each interaction term is calculated using mainly single precision. We also discuss further strategies connected with coordinate and velocity prediction required by the integration scheme. This leaves hard binaries and multiple close encounters which are treated by several regularization methods. The present nbody6-GPU code is well balanced for simulations in the particle range $10^4-2 \times 10^5$ for a dual GPU system attached to a standard PC.
△ Less
Submitted 6 May, 2012;
originally announced May 2012.
-
Phantom-GRAPE: numerical software library to accelerate collisionless $N$-body simulation with SIMD instruction set on x86 architecture
Authors:
Ataru Tanikawa,
Kohji Yoshikawa,
Keigo Nitadori,
Takashi Okamoto
Abstract:
(Abridged) We have developed a numerical software library for collisionless N-body simulations named "Phantom-GRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, AVX, an enhanced version of SSE. In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite…
▽ More
(Abridged) We have developed a numerical software library for collisionless N-body simulations named "Phantom-GRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, AVX, an enhanced version of SSE. In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius r_cut (i.e. f(r)=0 at r>r_cut), can be quickly computed. Using an Intel Core i7--2600 processor, we measure the performance of our library for both the forces. In the case of Newton's forces, we achieve 2 x 10^9 interactions per second with 1 processor core, which is 20 times higher than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE instructions. With 4 processor cores, we obtain the performance of 8 x 10^9 interactions per second. In the case of the arbitrarily shaped forces, we can calculate 1 x 10^9 and 4 x 10^9 interactions per second with 1 and 4 processor cores, respectively. The performance with 1 processor core is 6 times and 2 times higher than those of the implementations without any use of SIMD instructions and with the SSE instructions. These performances depend weakly on the number of particles. It is good contrast with the fact that the performance of force calculations accelerated by GPUs depends strongly on the number of particles. Substantially weak dependence of the performance on the number of particles is suitable to collisionless N-body simulations, since these simulations are usually performed with sophisticated N-body solvers such as Tree- and TreePM-methods combined with an individual timestep scheme. Collisionless N-body simulations accelerated with our library have significant advantage over those accelerated by GPUs, especially on massively parallel environments.
△ Less
Submitted 9 October, 2012; v1 submitted 19 March, 2012;
originally announced March 2012.
-
N-body simulation for self-gravitating collisional systems with a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions
Authors:
Ataru Tanikawa,
Kohji Yoshikawa,
Takashi Okamoto,
Keigo Nitadori
Abstract:
We present a high-performance N-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implem…
▽ More
We present a high-performance N-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implemented a fourth-order Hermite scheme with individual timestep scheme (Makino and Aarseth, 1992), and achieved the performance of 20 giga floating point number operations per second (GFLOPS) for double-precision accuracy, which is two times and five times higher than that of the previously developed code implemented with the SSE instructions (Nitadori et al., 2006b), and that of a code implemented without any explicit use of SIMD instructions with the same processor core, respectively. We have parallelized the code by using so-called NINJA scheme (Nitadori et al., 2006a), and achieved 90 GFLOPS for a system containing more than N = 8192 particles with 8 MPI processes on four cores. We expect to achieve about 10 tera FLOPS (TFLOPS) for a self-gravitating collisional system with N 105 on massively parallel systems with at most 800 cores with Sandy Bridge micro-architecture. This performance will be comparable to that of Graphic Processing Unit (GPU) cluster systems, such as the one with about 200 Tesla C1070 GPUs (Spurzem et al., 2010). This paper offers an alternative to collisional N-body simulations with GRAPEs and GPUs.
△ Less
Submitted 5 September, 2011; v1 submitted 14 April, 2011;
originally announced April 2011.
-
Simulating the universe on an intercontinental grid of supercomputers
Authors:
Simon Portegies Zwart,
Tomoaki Ishiyama,
Derek Groen,
Keigo Nitadori,
Junichiro Makino,
Cees de Laat,
Stephen McMillan,
Kei Hiraki,
Stefan Harfst,
Paola Grosso
Abstract:
Understanding the universe is hampered by the elusiveness of its most common constituent, cold dark matter. Almost impossible to observe, dark matter can be studied effectively by means of simulation and there is probably no other research field where simulation has led to so much progress in the last decade. Cosmological N-body simulations are an essential tool for evolving density perturbation…
▽ More
Understanding the universe is hampered by the elusiveness of its most common constituent, cold dark matter. Almost impossible to observe, dark matter can be studied effectively by means of simulation and there is probably no other research field where simulation has led to so much progress in the last decade. Cosmological N-body simulations are an essential tool for evolving density perturbations in the nonlinear regime. Simulating the formation of large-scale structures in the universe, however, is still a challenge due to the enormous dynamic range in spatial and temporal coordinates, and due to the enormous computer resources required. The dynamic range is generally dealt with by the hybridization of numerical techniques. We deal with the computational requirements by connecting two supercomputers via an optical network and make them operate as a single machine. This is challenging, if only for the fact that the supercomputers of our choice are separated by half the planet, as one is located in Amsterdam and the other is in Tokyo. The co-scheduling of the two computers and the 'gridification' of the code enables us to achieve a 90% efficiency for this distributed intercontinental supercomputer.
△ Less
Submitted 5 January, 2010;
originally announced January 2010.