M2L Translation Operators for Kernel Independent Fast Multipole Methods on Modern Architectures
Authors:
Srinath Kailasa,
Timo Betcke,
Sarah El Kazdadi
Abstract:
Hardware trends favor algorithm designs that maximize data reuse per FLOP. We develop and benchmark high-performance Multipole-to-Local (M2L) translation operators for the kernel-independent Fast Multipole Method (kiFMM), a widely adopted FMM variant that supports a broad class of kernels and has been favored by recent implementations for its simple specification. Naively implemented, M2L is bandw…
▽ More
Hardware trends favor algorithm designs that maximize data reuse per FLOP. We develop and benchmark high-performance Multipole-to-Local (M2L) translation operators for the kernel-independent Fast Multipole Method (kiFMM), a widely adopted FMM variant that supports a broad class of kernels and has been favored by recent implementations for its simple specification. Naively implemented, M2L is bandwidth-limited and therefore a key bottleneck in the FMM. State-of-the-art FFT-based M2L implementations, though elegant and with a fast setup time, suffer from low operational intensity and require architecture-specific optimizations. We demonstrate that a BLAS-based M2L, combined with randomized low-rank compression, achieves competitive performance with greater portability and a simpler implementation leveraging existing BLAS infrastructure, at the cost of higher setup times-especially for high-accuracy settings in double precision. Our Rust-based implementation enables seamless switching between strategies for fair benchmarking. Results on CPUs show that FFT-based M2L is favorable in low-accuracy settings or dynamic particle simulations, while BLAS-based M2L is favored for high-accuracy settings for static particle distributions, where its higher setup costs are amortized in many practical applications of the FMM.
△ Less
Submitted 28 May, 2025; v1 submitted 14 August, 2024;
originally announced August 2024.
PyExaFMM: an exercise in designing high-performance software with Python and Numba
Authors:
Srinath Kailasa,
Tingyu Wang,
Lorena A. Barba,
Timo Betcke
Abstract:
Numba is a game-changing compiler for high-performance computing with Python. It produces machine code that runs outside of the single-threaded Python interpreter and that fully utilizes the resources of modern CPUs. This means support for parallel multithreading and auto vectorization if available, as with compiled languages such as C++ or Fortran. In this article we document our experience devel…
▽ More
Numba is a game-changing compiler for high-performance computing with Python. It produces machine code that runs outside of the single-threaded Python interpreter and that fully utilizes the resources of modern CPUs. This means support for parallel multithreading and auto vectorization if available, as with compiled languages such as C++ or Fortran. In this article we document our experience developing PyExaFMM, a multithreaded Numba implementation of the Fast Multipole Method, an algorithm with a non-linear data structure and a large amount of data organization. We find that designing performant Numba code for complex algorithms can be as challenging as writing in a compiled language.
△ Less
Submitted 13 April, 2023; v1 submitted 15 March, 2023;
originally announced March 2023.