Search | arXiv e-print repository

arXiv:2308.01999 [pdf, other]

cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science

Authors: Harun Bayraktar, Ali Charara, David Clark, Saul Cohen, Timothy Costa, Yao-Lung L. Fang, Yang Gao, Jack Guan, John Gunnels, Azzam Haidar, Andreas Hehn, Markus Hohnerbach, Matthew Jones, Tom Lubowe, Dmitry Lyakh, Shinya Morino, Paul Springer, Sam Stanwyck, Igor Terentyev, Satya Varadhan, Jonathan Wong, Takuma Yamaguchi

Abstract: We present the NVIDIA cuQuantum SDK, a state-of-the-art library of composable primitives for GPU-accelerated quantum circuit simulations. As the size of quantum devices continues to increase, making their classical simulation progressively more difficult, the availability of fast and scalable quantum circuit simulators becomes vital for quantum algorithm developers, as well as quantum hardware eng… ▽ More We present the NVIDIA cuQuantum SDK, a state-of-the-art library of composable primitives for GPU-accelerated quantum circuit simulations. As the size of quantum devices continues to increase, making their classical simulation progressively more difficult, the availability of fast and scalable quantum circuit simulators becomes vital for quantum algorithm developers, as well as quantum hardware engineers focused on the validation and optimization of quantum devices. The cuQuantum SDK was created to accelerate and scale up quantum circuit simulators developed by the quantum information science community by enabling them to utilize efficient scalable software building blocks optimized for NVIDIA GPU platforms. The functional building blocks provided cover the needs of both state vector- and tensor network- based simulators, including approximate tensor network simulation methods based on matrix product state, projected entangled pair state, and other factorized tensor representations. By leveraging the enormous computing power of the latest NVIDIA GPU architectures, quantum circuit simulators that have adopted the cuQuantum SDK demonstrate significant acceleration, compared to CPU-only execution, for both the state vector and tensor network simulation methods. Furthermore, by utilizing the parallel primitives available in the cuQuantum SDK, one can easily transition to distributed GPU-accelerated platforms, including those furnished by cloud service providers and high-performance computing systems deployed by supercomputing centers, extending the scale of possible quantum circuit simulations. The rich capabilities provided by the SDK are conveniently made available via both Python and C application programming interfaces, where the former is directly targeting a broad Python quantum community and the latter allows tight integration with simulators written in any programming language. △ Less

Submitted 3 August, 2023; originally announced August 2023.

Comments: paper accepted at QCE 2023, journal reference will be updated whenever available

MSC Class: 68Q12; 68Q09; 81P68;

arXiv:2211.15460 [pdf]

Fragment-History Volumes

Authors: Francisco Inácio, Jan P. Springer

Abstract: Hardware-based triangle rasterization is still the prevalent method for generating images at real-time interactive frame rates. With the availability of a programmable graphics pipeline a large variety of techniques are supported for evaluating lighting and material properties of fragments. However, these techniques are usually restricted to evaluating local lighting and material effects. In addit… ▽ More Hardware-based triangle rasterization is still the prevalent method for generating images at real-time interactive frame rates. With the availability of a programmable graphics pipeline a large variety of techniques are supported for evaluating lighting and material properties of fragments. However, these techniques are usually restricted to evaluating local lighting and material effects. In addition, view-point changes require the complete processing of scene data to generate appropriate images. Reusing already rendered data in the frame buffer for a given view point by warping for a new viewpoint increases navigation fidelity at the expense of introducing artifacts for fragments previously hidden from the viewer. We present fragment-history volumes (FHV), a rendering technique based on a sparse, discretized representation of a 3d scene that emerges from recording all fragments that pass the rasterization stage in the graphics pipeline. These fragments are stored into per-pixel or per-octant lists for further processing; essentially creating an A-buffer. FHVs using per-octant fragment lists are view independent and allow fast resampling for image generation as well as for using more sophisticated approaches to evaluate material and lighting properties, eventually enabling global-illumination evaluation in the standard graphics pipeline available on current hardware. We show how FHVs are stored on the GPU in several ways, how they are created, and how they can be used for image generation at high rates. We discuss results for different usage scenarios, variations of the technique, and some limitations. △ Less

Submitted 28 November, 2022; originally announced November 2022.

ACM Class: I.3.7

arXiv:1805.10121 [pdf, other]

ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems

Authors: Jan Winkelmann, Paul Springer, Edoardo Di Napoli

Abstract: Solving dense Hermitian eigenproblems arranged in a sequence with direct solvers fails to take advantage of those spectral properties which are pertinent to the entire sequence, and not just to the single problem. When such features take the form of correlations between the eigenvectors of consecutive problems, as is the case in many real-world applications, the potential benefit of exploiting the… ▽ More Solving dense Hermitian eigenproblems arranged in a sequence with direct solvers fails to take advantage of those spectral properties which are pertinent to the entire sequence, and not just to the single problem. When such features take the form of correlations between the eigenvectors of consecutive problems, as is the case in many real-world applications, the potential benefit of exploiting them can be substantial. We present ChASE, a modern algorithm and library based on subspace iteration with polynomial acceleration. Novel to ChASE is the computation of the spectral estimates that enter in the filter and an optimization of the polynomial degree which further reduces the necessary FLOPs. ChASE is written in C++ using the modern software engineering concepts which favor a simple integration in application codes and a straightforward portability over heterogeneous platforms. When solving sequences of Hermitian eigenproblems for a portion of their extremal spectrum, ChASE greatly benefits from the sequence's spectral properties and outperforms direct solvers in many scenarios. The library ships with two distinct parallelization schemes, supports execution over distributed GPUs, and it is easily extensible to other parallel computing architectures. △ Less

Submitted 25 May, 2018; originally announced May 2018.

Comments: 33 pages. Submitted to ACM TOMS

arXiv:1705.06661 [pdf, other]

Spin Summations: A High-Performance Perspective

Authors: Paul Springer, Devin Matthews, Paolo Bientinesi

Abstract: Besides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms---including CCSD(T) and CCSDT(Q)---are spin summations. At a first sight, spin summations are operations similar to tensor transpositions; a closer look instead reveals additional challenges to hi… ▽ More Besides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms---including CCSD(T) and CCSDT(Q)---are spin summations. At a first sight, spin summations are operations similar to tensor transpositions; a closer look instead reveals additional challenges to high-performance calculations, including temporal locality as well as scattered memory accesses. This publication explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g. caches, vectorization), or the problem itself (e.g. factorizability). The final algorithm combines the advantages of all the solutions, while avoiding their drawbacks; this algorithm, achieves high-performance through parallelization, vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4x and 5.5x over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2x over an out-of-place variant. △ Less

Submitted 18 May, 2017; originally announced May 2017.

ACM Class: G.4; D.1.3

arXiv:1704.04374 [pdf, other]

HPTT: A High-Performance Tensor Transposition C++ Library

Authors: Paul Springer, Tong Su, Paolo Bientinesi

Abstract: Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library… ▽ More Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design---inspired by BLIS---makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework---guided by a performance model---that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, Intel Knights Landing, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen's tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x. △ Less

Submitted 10 May, 2017; v1 submitted 14 April, 2017; originally announced April 2017.

ACM Class: G.4; D.1.3

arXiv:1701.05242 [pdf, other]

doi 10.1007/978-3-319-20119-1_12

A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics

Authors: Paul Springer, Ahmed E. Ismail, Paolo Bientinesi

Abstract: Recent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial systems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoff-based method in which the cutoff is cho- sen on a particle-by-particle basis, according to the… ▽ More Recent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial systems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoff-based method in which the cutoff is cho- sen on a particle-by-particle basis, according to the distance from the interface. Computationally, the challenge is shifted from the long-range solvers to the detection of the interfaces and to the computation of the particle-interface distances. For these tasks, we present linear-time algorithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and mas- sively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS open-source molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particle-particle particle-mesh (PPPM) algorithm, while the performance is considerably superior for large numbers of particles. For JUQUEEN, we provide timings for simulations running on the full system (458, 752 cores), and show nearly perfect strong and weak scaling. △ Less

Submitted 18 January, 2017; originally announced January 2017.

Comments: in ISC High Performance 2015

arXiv:1607.01249 [pdf, other]

doi 10.1145/2935323.2935328

TTC: A Tensor Transposition Compiler for Multiple Architectures

Authors: Paul Springer, Aravind Sankaran, Paolo Bientinesi

Abstract: We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD Steamroller), I… ▽ More We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to $8\times$ and $32\times$, respectively. We also showcase TTC's support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines. △ Less

Submitted 5 July, 2016; originally announced July 2016.

ACM Class: G.4; D.3.4; I.1.2; I.1.3

arXiv:1607.00145 [pdf, other]

Design of a high-performance GEMM-like Tensor-Tensor Multiplication

Authors: Paul Springer, Paolo Bientinesi

Abstract: We present "GEMM-like Tensor-Tensor multiplication" (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly… ▽ More We present "GEMM-like Tensor-Tensor multiplication" (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned "macro-kernel". This macro-kernel operates on suitably prepared ("packed") sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source "Tensor Contraction Code Generator" (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to $12.4\times$. More precisely, GETT achieves speedups of up to $1.41\times$ over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to $91.3\%$ of peak floating-point performance for compute-bound tensor contractions. △ Less

Submitted 7 November, 2017; v1 submitted 1 July, 2016; originally announced July 2016.

ACM Class: G.4; D.3.4; I.1.2; I.1.3

arXiv:1603.02297 [pdf, other]

TTC: A high-performance Compiler for Tensor Transpositions

Authors: Paul Springer, Jeff R. Hammond, Paolo Bientinesi

Abstract: We present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark cov… ▽ More We present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, and yield significant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search. △ Less

Submitted 7 March, 2016; originally announced March 2016.

arXiv:1502.03234 [pdf, other]

A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Simulations of Interfacial Systems

Authors: Paul Springer

Abstract: This master thesis introduces the idea of dynamic cutoffs in molecular dynamics simulations, based on the distance between particles and the interface, and presents a solution for detecting interfaces in real-time. Our dynamic cutoff method (DCM) exhibits a linear-time complexity as well as nearly ideal weak and strong scaling. The DCM is tailored for massively parallel architectures and for large… ▽ More This master thesis introduces the idea of dynamic cutoffs in molecular dynamics simulations, based on the distance between particles and the interface, and presents a solution for detecting interfaces in real-time. Our dynamic cutoff method (DCM) exhibits a linear-time complexity as well as nearly ideal weak and strong scaling. The DCM is tailored for massively parallel architectures and for large interfacial systems with millions of particles. We implemented the DCM as part of the LAMMPS open-source molecular dynamics package and demonstrate the nearly ideal weak- and strong-scaling behavior of this method on an IBM BlueGene/Q supercomputer. Our results for a liquid/vapor system consisting of Lennard-Jones particles show that the accuracy of DCM is comparable to that of the traditional particle-particle particle- mesh (PPPM) algorithm. The performance comparison indicates that DCM is preferable for large systems due to the limited scaling of FFTs within the PPPM algorithm. Moreover, the DCM requires the interface to be identified every other MD timestep. As a consequence, this thesis also presents an interface detection method which is (1) applicable in real time; (2) parallelizable; and (3) scales linearly with respect to the number of particles. △ Less

Submitted 11 February, 2015; originally announced February 2015.

Showing 1–10 of 10 results for author: Springer, P