-
Direct Low-Dose CT Image Reconstruction on GPU using Out-Of-Core: Precision and Quality Study
Authors:
M. Chillarón,
G. Quintana-Ortí,
V. Vidal,
G. Verdú
Abstract:
Algebraic methods applied to the reconstruction of Sparse-view Computed Tomography (CT) can provide both a high image quality and a decrease in the dose received by patients, although with an increased reconstruction time since their computational costs are higher. In our work, we present a new algebraic implementation that obtains an exact solution to the system of linear equations that models th…
▽ More
Algebraic methods applied to the reconstruction of Sparse-view Computed Tomography (CT) can provide both a high image quality and a decrease in the dose received by patients, although with an increased reconstruction time since their computational costs are higher. In our work, we present a new algebraic implementation that obtains an exact solution to the system of linear equations that models the problem and based on single-precision floating-point arithmetic. By applying Out-Of-Core (OOC) techniques, the dimensions of the system can be increased regardless of the main memory size and as long as there is enough secondary storage (disk). These techniques have allowed to process images of 768 x 768 pixels. A comparative study of our method on a GPU using both single-precision and double-precision arithmetic has been carried out. The goal is to assess the single-precision arithmetic implementation both in terms of time improvement and quality of the reconstructed images to determine if it is sufficient to consider it a viable option. Results using single-precision arithmetic approximately halves the reconstruction time of the double-precision implementation, whereas the obtained images retain all internal structures despite having higher noise levels.
△ Less
Submitted 11 December, 2024; v1 submitted 10 December, 2024;
originally announced December 2024.
-
Fast Algorithms and Implementations for Computing the Minimum Distance of Quantum Codes
Authors:
Fernando Hernando,
Gregorio Quintana-Ortí,
Markus Grassl
Abstract:
The distance of a stabilizer quantum code is a very important feature since it determines the number of errors that can be detected and corrected. We present three new fast algorithms and implementations for computing the symplectic distance of the associated classical code. Our new algorithms are based on the Brouwer-Zimmermann algorithm. Our experimental study shows that these new implementation…
▽ More
The distance of a stabilizer quantum code is a very important feature since it determines the number of errors that can be detected and corrected. We present three new fast algorithms and implementations for computing the symplectic distance of the associated classical code. Our new algorithms are based on the Brouwer-Zimmermann algorithm. Our experimental study shows that these new implementations are much faster than current state-of-the-art licensed implementations on single-core processors, multicore processors, and shared-memory multiprocessors. In the most computationally-demanding cases, the performance gain in the computational time can be larger than one order of magnitude. The experimental study also shows a good scalability on shared-memory parallel architectures.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures
Authors:
Mónica Chillarón,
Gregorio Quintana-Ortí,
Vicente Vidal,
Per-Gunnar Martinsson
Abstract:
Solving very large linear systems of equations is a key computational task in science and technology. In many cases, the coefficient matrix of the linear system is rank-deficient, leading to systems that may be underdetermined, inconsistent, or both. In such cases, one generally seeks to compute the least squares solution that minimizes the residual of the problem, which can be further defined as…
▽ More
Solving very large linear systems of equations is a key computational task in science and technology. In many cases, the coefficient matrix of the linear system is rank-deficient, leading to systems that may be underdetermined, inconsistent, or both. In such cases, one generally seeks to compute the least squares solution that minimizes the residual of the problem, which can be further defined as the solution with smallest norm in cases where the coefficient matrix has a nontrivial nullspace. This work presents several new techniques for solving least squares problems involving coefficient matrices that are so large that they do not fit in main memory. The implementations include both CPU and GPU variants. All techniques rely on complete orthogonal decompositions that guarantee that both conditions of a least squares solution are met, regardless of the rank properties of the matrix. Specifically, they rely on the recently proposed "randUTV" algorithm that is particularly effective in strongly communication-constrained environments. A detailed precision and performance study reveals that the new methods, that operate on data stored on disk, are competitive with state-of-the-art methods that store all data in main memory.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Efficient algorithms for computing a rank-revealing UTV factorization on parallel computing architectures
Authors:
N. Heavner,
F. D. Igual,
G. Quintana-Ortí,
P. G. Martinsson
Abstract:
The randomized singular value decomposition (RSVD) is by now a well established technique for efficiently computing an approximate singular value decomposition of a matrix. Building on the ideas that underpin the RSVD, the recently proposed algorithm "randUTV" computes a FULL factorization of a given matrix that provides low-rank approximations with near-optimal error. Because the bulk of randUTV…
▽ More
The randomized singular value decomposition (RSVD) is by now a well established technique for efficiently computing an approximate singular value decomposition of a matrix. Building on the ideas that underpin the RSVD, the recently proposed algorithm "randUTV" computes a FULL factorization of a given matrix that provides low-rank approximations with near-optimal error. Because the bulk of randUTV is cast in terms of communication-efficient operations like matrix-matrix multiplication and unpivoted QR factorizations, it is faster than competing rank-revealing factorization methods like column pivoted QR in most high performance computational settings. In this article, optimized randUTV implementations are presented for both shared memory and distributed memory computing environments. For shared memory, randUTV is redesigned in terms of an "algorithm-by-blocks" that, together with a runtime task scheduler, eliminates bottlenecks from data synchronization points to achieve acceleration over the standard "blocked algorithm", based on a purely fork-join approach. The distributed memory implementation is based on the ScaLAPACK library. The performances of our new codes compare favorably with competing factorizations available on both shared memory and distributed memory architectures.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
Computing rank-revealing factorizations of matrices stored out-of-core
Authors:
Nathan Heavner,
Per-Gunnar Martinsson,
Gregorio Quintana-Ortí
Abstract:
This paper describes efficient algorithms for computing rank-revealing factorizations of matrices that are too large to fit in RAM, and must instead be stored on slow external memory devices such as solid-state or spinning disk hard drives (out-of-core or out-of-memory). Traditional algorithms for computing rank revealing factorizations, such as the column pivoted QR factorization, or techniques f…
▽ More
This paper describes efficient algorithms for computing rank-revealing factorizations of matrices that are too large to fit in RAM, and must instead be stored on slow external memory devices such as solid-state or spinning disk hard drives (out-of-core or out-of-memory). Traditional algorithms for computing rank revealing factorizations, such as the column pivoted QR factorization, or techniques for computing a full singular value decomposition of a matrix, are very communication intensive. They are naturally expressed as a sequence of matrix-vector operations, which become prohibitively expensive when data is not available in main memory. Randomization allows these methods to be reformulated so that large contiguous blocks of the matrix can be processed in bulk. The paper describes two distinct methods. The first is a blocked version of column pivoted Householder QR, organized as a "left-looking" method to minimize the number of write operations (which are more expensive than read operations on a spinning disk drive). The second method results in a so called UTV factorization which expresses a matrix $A$ as $A = U T V^*$ where $U$ and $V$ are unitary, and $T$ is triangular. This method is organized as an algorithm-by-blocks, in which floating point operations overlap read and write operations. The second method incorporates power iterations, and is exceptionally good at revealing the numerical rank; it can often be used as a substitute for a full singular value decomposition. Numerical experiments demonstrate that the new algorithms are almost as fast when processing data stored on a hard drive as traditional algorithms are for data stored in main memory. To be precise, the computational time for fully factorizing an $n\times n$ matrix scales as $cn^{3}$, with a scaling constant $c$ that is only marginally larger when the matrix is stored out of core.
△ Less
Submitted 4 March, 2020; v1 submitted 17 February, 2020;
originally announced February 2020.
-
Parallel Implementations for Computing the Minimum Distance of a Random Linear Code on Multicomputers
Authors:
Gregorio Quintana-Ortí,
Fernando Hernando,
Francisco D. Igual
Abstract:
The minimum distance of a linear code is a key concept in information theory. Therefore, the time required by its computation is very important to many problems in this area. In this paper, we introduce a family of implementations of the Brouwer-Zimmermann algorithm for distributed-memory architectures for computing the minimum distance of a random linear code over F2. Both current commercial and…
▽ More
The minimum distance of a linear code is a key concept in information theory. Therefore, the time required by its computation is very important to many problems in this area. In this paper, we introduce a family of implementations of the Brouwer-Zimmermann algorithm for distributed-memory architectures for computing the minimum distance of a random linear code over F2. Both current commercial and public-domain software only work on either unicore architectures or shared-memory architectures, which are limited in the number of cores/processors employed in the computation. Our implementations focus on distributed-memory architectures, thus being able to employ hundreds or even thousands of cores in the computation of the minimum distance. Our experimental results show that our implementations are much faster, even up to several orders of magnitude, than current implementations widely used nowadays.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.
-
Computed tomography medical image reconstruction on affordable equipment by using out-of-core techniques
Authors:
Mónica Chillarón,
Gregorio Quintana-Ortí,
Vicente Vidal,
Gumersindo Verdú
Abstract:
As Computed Tomography (CT) scans are an essential medical test, many techniques have been proposed to reconstruct high-quality images using a smaller amount of radiation. One approach is to employ algebraic factorization methods to reconstruct the images, using fewer views than the traditional analytical methods. However, their main drawback is the high computational cost and hence the time neede…
▽ More
As Computed Tomography (CT) scans are an essential medical test, many techniques have been proposed to reconstruct high-quality images using a smaller amount of radiation. One approach is to employ algebraic factorization methods to reconstruct the images, using fewer views than the traditional analytical methods. However, their main drawback is the high computational cost and hence the time needed to obtain the images, which is critical in the daily clinical practice. For this reason, faster methods for solving this problem are required. In this paper, we propose a new reconstruction method based on the QR factorization that is very efficient on affordable equipment (standard multicore processors and standard Solid-State Drives) by using out-of-core techniques. Combining both affordable hardware and the new software, we can boost the performance of the reconstructions and implement a reliable and competitive method that reconstructs high-quality CT images quickly.
△ Less
Submitted 27 June, 2019;
originally announced July 2019.
-
Fast Algorithms for the Computation of the Minimum Distance of a Random Linear Code
Authors:
Fernando Hernando,
Francisco D. Igual,
Gregorio Quintana-Ortí
Abstract:
The minimum distance of a code is an important concept in information theory. Hence, computing the minimum distance of a code with a minimum computational cost is a crucial process to many problems in this area. In this paper, we present and evaluate a family of algorithms and implementations to compute the minimum distance of a random linear code over $\mathbb{F}_{2}$ that are faster than differe…
▽ More
The minimum distance of a code is an important concept in information theory. Hence, computing the minimum distance of a code with a minimum computational cost is a crucial process to many problems in this area. In this paper, we present and evaluate a family of algorithms and implementations to compute the minimum distance of a random linear code over $\mathbb{F}_{2}$ that are faster than different current implementations. In addition to the basic sequential implementations, we present parallel and vectorized implementations that render high performances on modern architectures. The attained performance results show the benefits of the developed optimized algorithms, which obtain remarkable performance improvements compared with state-of-the-art implementations widely used nowadays.
△ Less
Submitted 30 January, 2017; v1 submitted 22 March, 2016;
originally announced March 2016.
-
Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions
Authors:
Edoardo Di Napoli,
Diego Fabregat-Traver,
Gregorio Quintana-Ortì,
Paolo Bientinesi
Abstract:
Mathematical operators whose transformation rules constitute the building blocks of a multi-linear algebra are widely used in physics and engineering applications where they are very often represented as tensors. In the last century, thanks to the advances in tensor calculus, it was possible to uncover new research fields and make remarkable progress in the existing ones, from electromagnetism to…
▽ More
Mathematical operators whose transformation rules constitute the building blocks of a multi-linear algebra are widely used in physics and engineering applications where they are very often represented as tensors. In the last century, thanks to the advances in tensor calculus, it was possible to uncover new research fields and make remarkable progress in the existing ones, from electromagnetism to the dynamics of fluids and from the mechanics of rigid bodies to quantum mechanics of many atoms. By now, the formal mathematical and geometrical properties of tensors are well defined and understood; conversely, in the context of scientific and high-performance computing, many tensor- related problems are still open. In this paper, we address the problem of efficiently computing contractions among two tensors of arbitrary dimension by using kernels from the highly optimized BLAS library. In particular, we establish precise conditions to determine if and when GEMM, the kernel for matrix products, can be used. Such conditions take into consideration both the nature of the operation and the storage scheme of the tensors, and induce a classification of the contractions into three groups. For each group, we provide a recipe to guide the users towards the most effective use of BLAS.
△ Less
Submitted 8 July, 2013;
originally announced July 2013.