Showing 1–2 of 2 results for author: Kaczmarek, O

Search v0.5.6 released 2020-02-24

arXiv:1411.4439 [pdf, other]

physics.comp-ph cs.MS hep-lat

Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs

Authors: O. Kaczmarek, C. Schmidt, P. Steinbrecher, M. Wagner

Abstract: Lattice Quantum Chromodynamics simulations typically spend most of the runtime in inversions of the Fermion Matrix. This part is therefore frequently optimized for various HPC architectures. Here we compare the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver. By exposing more parallelism to the accelerator through inverting multiple v… ▽ More Lattice Quantum Chromodynamics simulations typically spend most of the runtime in inversions of the Fermion Matrix. This part is therefore frequently optimized for various HPC architectures. Here we compare the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver. By exposing more parallelism to the accelerator through inverting multiple vectors at the same time, we obtain a performance greater than 300 GFlop/s on both architectures. This more than doubles the performance of the inversions. We also give a short overview of the Knights Corner architecture, discuss some details of the implementation and the effort required to obtain the achieved performance. △ Less

Submitted 17 November, 2014; originally announced November 2014.

Comments: 7 pages, proceedings, presented at 'GPU Computing in High Energy Physics', September 10-12, 2014, Pisa, Italy
arXiv:1409.1510 [pdf, other]

cs.DC hep-lat

HISQ inverter on Intel Xeon Phi and NVIDIA GPUs

Authors: O. Kaczmarek, C. Schmidt, P. Steinbrecher, Swagato Mukherjee, M. Wagner

Abstract: The runtime of a Lattice QCD simulation is dominated by a small kernel, which calculates the product of a vector by a sparse matrix known as the "Dslash" operator. Therefore, this kernel is frequently optimized for various HPC architectures. In this contribution we compare the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver. By exposi… ▽ More The runtime of a Lattice QCD simulation is dominated by a small kernel, which calculates the product of a vector by a sparse matrix known as the "Dslash" operator. Therefore, this kernel is frequently optimized for various HPC architectures. In this contribution we compare the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver. By exposing more parallelism to the accelerator through inverting multiple vectors at the same time we obtain a performance 250 GFlop/s on both architectures. This more than doubles the performance of the inversions. We give a short overview of both architectures, discuss some details of the implementation and the effort required to obtain the achieved performance. △ Less

Submitted 4 September, 2014; originally announced September 2014.

Comments: 7 pages, proceedings, presented at the 32nd International Symposium on Lattice Field Theory (Lattice 2014), June 23 to June 28, 2014, New York, USA

Search v0.5.6 released 2020-02-24