-
Large Scale Distributed Linear Algebra With Tensor Processing Units
Authors:
Adam G. M. Lewis,
Jackson Beall,
Martin Ganahl,
Markus Hauru,
Shrestha Basu Mallick,
Guifre Vidal
Abstract:
We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this reg…
▽ More
We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this regime, the matrix-multiply units (MXU)s dominate the runtime, yielding impressive scaling, performance, and raw size: operating in float32 precision, a full 2048-core pod of third generation TPUs can multiply two matrices with linear size $N= 220= 1 048 576$ in about 2 minutes. Via curated algorithms emphasizing large, single-core matrix multiplications, other tasks in dense linear algebra can similarly scale. As examples, we present (i) QR decomposition; (ii) resolution of linear systems; and (iii) the computation of matrix functions by polynomial iteration, demonstrated by the matrix polar factorization.
△ Less
Submitted 16 December, 2021;
originally announced December 2021.
-
Quantum Computation with Machine-Learning-Controlled Quantum Stuff
Authors:
Lucien Hardy,
Adam G. M. Lewis
Abstract:
We describe how one may go about performing quantum computation with arbitrary "quantum stuff", as long as it has some basic physical properties. Imagine a long strip of stuff, equipped with regularly spaced wires to provide input settings and to read off outcomes. After showing how the corresponding map from settings to outcomes can be construed as a quantum circuit, we provide a machine learning…
▽ More
We describe how one may go about performing quantum computation with arbitrary "quantum stuff", as long as it has some basic physical properties. Imagine a long strip of stuff, equipped with regularly spaced wires to provide input settings and to read off outcomes. After showing how the corresponding map from settings to outcomes can be construed as a quantum circuit, we provide a machine learning algorithm to tomographically "learn" which settings implement the members of a universal gate set. At optimum, arbitrary quantum gates, and thus arbitrary quantum programs, can be implemented using the stuff.
△ Less
Submitted 29 November, 2019;
originally announced November 2019.
-
Automatic generation of CUDA code performing tensor manipulations using C++ expression templates
Authors:
Adam G. M. Lewis,
Harald P. Pfeiffer
Abstract:
We present a C++ library, TLoops, which uses a hierarchy of expression templates to represent operations upon tensorial quantities in single lines of C++ code that resemble analytic equations. These expressions may be run as-is, but may also be used to emit equivalent low-level C or CUDA code, which either performs the operations more quickly on the CPU, or allows them to be rapidly ported to run…
▽ More
We present a C++ library, TLoops, which uses a hierarchy of expression templates to represent operations upon tensorial quantities in single lines of C++ code that resemble analytic equations. These expressions may be run as-is, but may also be used to emit equivalent low-level C or CUDA code, which either performs the operations more quickly on the CPU, or allows them to be rapidly ported to run on NVIDIA GPUs. We detail the expression template and C++-class hierarchy that represents the expressions and which makes automatic code-generation possible. We then present benchmarks of the expression-template code, the automatically generated C code, and the automatically generated CUDA code running on several generations of NVIDIA GPU.
△ Less
Submitted 24 April, 2018;
originally announced April 2018.