-
Efficient vectorized evaluation of Gaussian AO integrals on modern central processing units
Authors:
Andrey Asadchev,
Edward F. Valeev
Abstract:
We report an implementation of the McMurchie-Davidson evaluation scheme for 1- and 2-particle Gaussian AO integrals designed for efficient execution on modern central processing units (CPUs) with Single Instruction Multiple Data (SIMD) instruction sets. Like in our recent MD implementation for graphical processing units (GPUs) [J. Chem. Phys. 160, 244109 (2024)], variable-sized batches of shellset…
▽ More
We report an implementation of the McMurchie-Davidson evaluation scheme for 1- and 2-particle Gaussian AO integrals designed for efficient execution on modern central processing units (CPUs) with Single Instruction Multiple Data (SIMD) instruction sets. Like in our recent MD implementation for graphical processing units (GPUs) [J. Chem. Phys. 160, 244109 (2024)], variable-sized batches of shellsets of integrals are evaluated at a time. By optimizing for the floating point instruction throughput rather than minimizing the number of operations, this approach achieves up to 50% of the theoretical hardware peak FP64 performance for many common SIMD-equipped platforms (AVX2, AVX512, NEON), which translates to speedups of up to 30 over the state-of-the-art one-shellset-at-a-time implementation of Obara-Saika-type schemes in Libint for a variety of primitive and contracted integrals. As with our previous work, we rely on the standard C++ programming language -- such as the std::simd standard library feature to be included in the 2026 ISO C++ standard -- without any explicit code generation to keep the code base small and portable. The implementation is part of the open source LibintX library freely available at https://github.com/ValeevGroup/libintx.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
3-center and 4-center 2-particle Gaussian AO integrals on modern accelerated processors
Authors:
Andrey Asadchev,
Edward F. Valeev
Abstract:
We report an implementation of the McMurchie-Davidson (MD) algorithm for 3-center and 4-center 2-particle integrals over Gaussian atomic orbitals (AOs) with low and high angular momenta $l$ and varying degrees of contraction for graphical processing units (GPUs). This work builds upon our recent implementation of a matrix form of the MD algorithm that is efficient for GPU evaluation of 4-center 2-…
▽ More
We report an implementation of the McMurchie-Davidson (MD) algorithm for 3-center and 4-center 2-particle integrals over Gaussian atomic orbitals (AOs) with low and high angular momenta $l$ and varying degrees of contraction for graphical processing units (GPUs). This work builds upon our recent implementation of a matrix form of the MD algorithm that is efficient for GPU evaluation of 4-center 2-particle integrals over Gaussian AOs of high angular momenta ($l\geq 4$) [$\mathit{J. Phys. Chem. A}\ \mathbf{127}$, 10889 (2023)]. The use of unconventional data layouts and three variants of the MD algorithm allow to evaluate integrals in double precision with sustained performance between 25% and 70% of the theoretical hardware peak. Performance assessment includes integrals over AOs with $l\leq 6$ (higher $l$ is supported). Preliminary implementation of the Hartree-Fock exchange operator is presented and assessed for computations with up to quadruple-zeta basis and more than 20,000 AOs. The corresponding C++ code is a part of the experimental open-source $\mathtt{LibintX}$ library available at $\mathbf{github.com:ValeevGroup/LibintX}$.
△ Less
Submitted 30 May, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
CoNST: Code Generator for Sparse Tensor Networks
Authors:
Saurabh Raje,
Yufan Xu,
Atanas Rountev,
Edward F. Valeev,
Saday Sadayappan
Abstract:
Sparse tensor networks are commonly used to represent contractions over sparse tensors. Tensor contractions are higher-order analogs of matrix multiplication. Tensor networks arise commonly in many domains of scientific computing and data science. After a transformation into a tree of binary contractions, the network is implemented as a sequence of individual contractions. Several critical aspects…
▽ More
Sparse tensor networks are commonly used to represent contractions over sparse tensors. Tensor contractions are higher-order analogs of matrix multiplication. Tensor networks arise commonly in many domains of scientific computing and data science. After a transformation into a tree of binary contractions, the network is implemented as a sequence of individual contractions. Several critical aspects must be considered in the generation of efficient code for a contraction tree, including sparse tensor layout mode order, loop fusion to reduce intermediate tensors, and the interdependence of loop order, mode order, and contraction order. We propose CoNST, a novel approach that considers these factors in an integrated manner using a single formulation. Our approach creates a constraint system that encodes these decisions and their interdependence, while aiming to produce reduced-order intermediate tensors via fusion. The constraint system is solved by the Z3 SMT solver and the result is used to create the desired fused loop structure and tensor mode layouts for the entire contraction tree. This structure is lowered to the IR of the TACO compiler, which is then used to generate executable code. Our experimental evaluation demonstrates very significant (sometimes orders of magnitude) performance improvements over current state-of-the-art sparse tensor compiler/library alternatives.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
High-performance evaluation of high angular momentum 4-center Gaussian integrals on modern accelerated processors
Authors:
Andrey Asadchev,
Edward F. Valeev
Abstract:
We present a high-performance evaluation method for 4-center 2-particle integrals over Gaussian atomic orbitals with high angular momenta ($l\geq4$) and arbitrary contraction degrees on graphical processing units (GPUs) and other accelerators. The implementation uses the matrix form of McMurchie-Davidson recurrences. Evaluation of the 4-center integrals over four $l=6$ ($i$) Gaussian AOs in the do…
▽ More
We present a high-performance evaluation method for 4-center 2-particle integrals over Gaussian atomic orbitals with high angular momenta ($l\geq4$) and arbitrary contraction degrees on graphical processing units (GPUs) and other accelerators. The implementation uses the matrix form of McMurchie-Davidson recurrences. Evaluation of the 4-center integrals over four $l=6$ ($i$) Gaussian AOs in the double precision (FP64) on an NVIDIA V100 GPU outperforms the reference implementation of the Obara-Saika recurrences (${\tt Libint}$) running on a single Intel Xeon core by more than a factor of 1000, easily exceeding the 73:1 ratio of the respective hardware peak FLOP rates while reaching almost 50\% of the V100 peak. The approach can be extended to support AOs with even higher angular momenta; for lower angular momenta ($l\leq3$) additional improvements will be reported elsewhere. The implementation is part of an open-source ${\tt LibintX}$ library feely available at https://github.com/ValeevGroup/LibintX.
△ Less
Submitted 19 December, 2023; v1 submitted 7 July, 2023;
originally announced July 2023.
-
Memory-Efficient Recursive Evaluation of 3-Center Gaussian Integrals
Authors:
Andrey Asadchev,
Edward F. Valeev
Abstract:
To improve the efficiency of Gaussian integral evaluation on modern accelerated architectures FLOP-efficient Obara-Saika-based recursive evaluation schemes are optimized for the memory footprint. For the 3-center 2-particle integrals that are key for the evaluation of Coulomb and other 2-particle interactions in the density-fitting approximation the use of multi-quantal recurrences (in which multi…
▽ More
To improve the efficiency of Gaussian integral evaluation on modern accelerated architectures FLOP-efficient Obara-Saika-based recursive evaluation schemes are optimized for the memory footprint. For the 3-center 2-particle integrals that are key for the evaluation of Coulomb and other 2-particle interactions in the density-fitting approximation the use of multi-quantal recurrences (in which multiple quanta are created or transferred at once) is shown to produce significant memory savings. Other innovation include leveraging register memory for reduced memory footprint and direct compile-time generation of optimized kernels (instead of custom code generation) with compile-time features of modern C++/CUDA. Performance of conventional and CUDA-based implementations of the proposed schemes is illustrated for both the individual batches of integrals involving up to Gaussians with low and high angular momenta (up to $L=6$) and contraction degrees, as well as for the density-fitting-based evaluation of the Coulomb potential. The computer implementation is available in the open-source LibintX library.
△ Less
Submitted 16 January, 2023; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices
Authors:
Justus A. Calvin,
Cannada A. Lewis,
Edward F. Valeev
Abstract:
A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based…
▽ More
A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization.Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).
△ Less
Submitted 9 October, 2015; v1 submitted 1 September, 2015;
originally announced September 2015.
-
MADNESS: A Multiresolution, Adaptive Numerical Environment for Scientific Simulation
Authors:
Robert J. Harrison,
Gregory Beylkin,
Florian A. Bischoff,
Justus A. Calvin,
George I. Fann,
Jacob Fosso-Tande,
Diego Galindo,
Jeff R. Hammond,
Rebecca Hartman-Baker,
Judith C. Hill,
Jun Jia,
Jakob S. Kottmann,
M-J. Yvonne Ou,
Laura E. Ratcliff,
Matthew G. Reuter,
Adam C. Richie-Halford,
Nichols A. Romero,
Hideo Sekino,
William A. Shelton,
Bryan E. Sundahl,
W. Scott Thornton,
Edward F. Valeev,
Álvaro Vázquez-Mayagoitia,
Nicholas Vence,
Yukina Yokoi
Abstract:
MADNESS (multiresolution adaptive numerical environment for scientific simulation) is a high-level software environment for solving integral and differential equations in many dimensions that uses adaptive and fast harmonic analysis methods with guaranteed precision based on multiresolution analysis and separated representations. Underpinning the numerical capabilities is a powerful petascale para…
▽ More
MADNESS (multiresolution adaptive numerical environment for scientific simulation) is a high-level software environment for solving integral and differential equations in many dimensions that uses adaptive and fast harmonic analysis methods with guaranteed precision based on multiresolution analysis and separated representations. Underpinning the numerical capabilities is a powerful petascale parallel programming environment that aims to increase both programmer productivity and code scalability. This paper describes the features and capabilities of MADNESS and briefly discusses some current applications in chemistry and several areas of physics.
△ Less
Submitted 5 July, 2015;
originally announced July 2015.
-
Task-Based Algorithm for Matrix Multiplication: A Step Towards Block-Sparse Tensor Computing
Authors:
Justus A. Calvin,
Edward F. Valeev
Abstract:
Distributed-memory matrix multiplication (MM) is a key element of algorithms in many domains (machine learning, quantum physics). Conventional algorithms for dense MM rely on regular/uniform data decomposition to ensure load balance. These traits conflict with the irregular structure (block-sparse or rank-sparse within blocks) that is increasingly relevant for fast methods in quantum physics. To d…
▽ More
Distributed-memory matrix multiplication (MM) is a key element of algorithms in many domains (machine learning, quantum physics). Conventional algorithms for dense MM rely on regular/uniform data decomposition to ensure load balance. These traits conflict with the irregular structure (block-sparse or rank-sparse within blocks) that is increasingly relevant for fast methods in quantum physics. To deal with such irregular data we present a new MM algorithm based on Scalable Universal Matrix Multiplication Algorithm (SUMMA). The novel features are: (1) multiple-issue scheduling of SUMMA iterations, and (2) fine-grained task-based formulation. The latter eliminates the need for explicit internodal synchronization; with multiple-iteration scheduling this allows load imbalance due to nonuniform matrix structure. For square MM with uniform and nonuniform block sizes (the latter simulates matrices with general irregular structure) we found excellent performance in weak and strong-scaling regimes, on commodity and high-end hardware.
△ Less
Submitted 20 April, 2015;
originally announced April 2015.