-
A Mixed Precision, Multi-GPU Design for Large-scale Top-K Sparse Eigenproblems
Authors:
Francesco Sgherzi,
Alberto Parravicini,
Marco Domenico Santambrogio
Abstract:
Graph analytics techniques based on spectral methods process extremely large sparse matrices with millions or even billions of non-zero values. Behind these algorithms lies the Top-K sparse eigenproblem, the computation of the largest eigenvalues and their associated eigenvectors. In this work, we leverage GPUs to scale the Top-K sparse eigenproblem to bigger matrices than previously achieved whil…
▽ More
Graph analytics techniques based on spectral methods process extremely large sparse matrices with millions or even billions of non-zero values. Behind these algorithms lies the Top-K sparse eigenproblem, the computation of the largest eigenvalues and their associated eigenvectors. In this work, we leverage GPUs to scale the Top-K sparse eigenproblem to bigger matrices than previously achieved while also providing state-of-the-art execution times. We can transparently partition the computation across multiple GPUs, process out-of-core matrices, and tune precision and execution time using mixed-precision floating-point arithmetic. Overall, we are 67 times faster than the highly optimized ARPACK library running on a 104-thread CPU and 1.9 times than a recent FPGA hardware design. We also determine how mixed-precision floating-point arithmetic improves execution time by 50% over double-precision, and is 12 times more accurate than single-precision floating-point arithmetic.
△ Less
Submitted 19 January, 2022;
originally announced January 2022.
-
Demystifying Drug Repurposing Domain Comprehension with Knowledge Graph Embedding
Authors:
Edoardo Ramalli,
Alberto Parravicini,
Guido Walter Di Donato,
Mirko Salaris,
Céline Hudelot,
Marco Domenico Santambrogio
Abstract:
Drug repurposing is more relevant than ever due to drug development's rising costs and the need to respond to emerging diseases quickly. Knowledge graph embedding enables drug repurposing using heterogeneous data sources combined with state-of-the-art machine learning models to predict new drug-disease links in the knowledge graph. As in many machine learning applications, significant work is stil…
▽ More
Drug repurposing is more relevant than ever due to drug development's rising costs and the need to respond to emerging diseases quickly. Knowledge graph embedding enables drug repurposing using heterogeneous data sources combined with state-of-the-art machine learning models to predict new drug-disease links in the knowledge graph. As in many machine learning applications, significant work is still required to understand the predictive models' behavior. We propose a structured methodology to understand better machine learning models' results for drug repurposing, suggesting key elements of the knowledge graph to improve predictions while saving computational resources. We reduce the training set of 11.05% and the embedding space by 31.87%, with only a 2% accuracy reduction, and increase accuracy by 60% on the open ogbl-biokg graph adding only 1.53% new triples.
△ Less
Submitted 30 August, 2021;
originally announced August 2021.
-
Solving Large Top-K Graph Eigenproblems with a Memory and Compute-optimized FPGA Design
Authors:
Francesco Sgherzi,
Alberto Parravicini,
Marco Siracusa,
Marco Domenico Santambrogio
Abstract:
Large-scale eigenvalue computations on sparse matrices are a key component of graph analytics techniques based on spectral methods. In such applications, an exhaustive computation of all eigenvalues and eigenvectors is impractical and unnecessary, as spectral methods can retrieve the relevant properties of enormous graphs using just the eigenvectors associated with the Top-K largest eigenvalues.…
▽ More
Large-scale eigenvalue computations on sparse matrices are a key component of graph analytics techniques based on spectral methods. In such applications, an exhaustive computation of all eigenvalues and eigenvectors is impractical and unnecessary, as spectral methods can retrieve the relevant properties of enormous graphs using just the eigenvectors associated with the Top-K largest eigenvalues.
In this work, we propose a hardware-optimized algorithm to approximate a solution to the Top-K eigenproblem on sparse matrices representing large graph topologies. We prototype our algorithm through a custom FPGA hardware design that exploits HBM, Systolic Architectures, and mixed-precision arithmetic. We achieve a speedup of 6.22x compared to the highly optimized ARPACK library running on an 80-thread CPU, while keeping high accuracy and 49x better power efficiency.
△ Less
Submitted 18 March, 2021;
originally announced March 2021.
-
Scaling up HBM Efficiency of Top-K SpMV for Approximate Embedding Similarity on FPGAs
Authors:
Alberto Parravicini,
Luca Giuseppe Cellamare,
Marco Siracusa,
Marco Domenico Santambrogio
Abstract:
Top-K SpMV is a key component of similarity-search on sparse embeddings. This sparse workload does not perform well on general-purpose NUMA systems that employ traditional caching strategies. Instead, modern FPGA accelerator cards have a few tricks up their sleeve. We introduce a Top-K SpMV FPGA design that leverages reduced precision and a novel packet-wise CSR matrix compression, enabling custom…
▽ More
Top-K SpMV is a key component of similarity-search on sparse embeddings. This sparse workload does not perform well on general-purpose NUMA systems that employ traditional caching strategies. Instead, modern FPGA accelerator cards have a few tricks up their sleeve. We introduce a Top-K SpMV FPGA design that leverages reduced precision and a novel packet-wise CSR matrix compression, enabling custom data layouts and delivering bandwidth efficiency often unreachable even in architectures with higher peak bandwidth. With HBM-based boards, we are 100x faster than a multi-threaded CPU implementation and 2x faster than a GPU with 20% higher bandwidth, with 14.2x higher power-efficiency.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime
Authors:
Alberto Parravicini,
Arnaud Delamare,
Marco Arnaboldi,
Marco D. Santambrogio
Abstract:
GPUs are readily available in cloud computing and personal devices, but their use for data processing acceleration has been slowed down by their limited integration with common programming languages such as Python or Java. Moreover, using GPUs to their full capabilities requires expert knowledge of asynchronous programming. In this work, we present a novel GPU run time scheduler for multi-task GPU…
▽ More
GPUs are readily available in cloud computing and personal devices, but their use for data processing acceleration has been slowed down by their limited integration with common programming languages such as Python or Java. Moreover, using GPUs to their full capabilities requires expert knowledge of asynchronous programming. In this work, we present a novel GPU run time scheduler for multi-task GPU computations that transparently provides asynchronous execution, space-sharing, and transfer-computation overlap without requiring in advance any information about the program dependency structure. We leverage the GrCUDA polyglot API to integrate our scheduler with multiple high-level languages and provide a platform for fast prototyping and easy GPU acceleration. We validate our work on 6 benchmarks created to evaluate task-parallelism and show an average of 44% speedup against synchronous execution, with no execution time slowdown compared to hand-optimized host code written using the C++ CUDA Graphs API.
△ Less
Submitted 19 January, 2021; v1 submitted 17 December, 2020;
originally announced December 2020.
-
A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA
Authors:
Alberto Parravicini,
Francesco Sgherzi,
Marco D. Santambrogio
Abstract:
Sparse matrix-vector multiplication is often employed in many data-analytic workloads in which low latency and high throughput are more valuable than exact numerical convergence. FPGAs provide quick execution times while offering precise control over the accuracy of the results thanks to reduced-precision fixed-point arithmetic. In this work, we propose a novel streaming implementation of Coordina…
▽ More
Sparse matrix-vector multiplication is often employed in many data-analytic workloads in which low latency and high throughput are more valuable than exact numerical convergence. FPGAs provide quick execution times while offering precise control over the accuracy of the results thanks to reduced-precision fixed-point arithmetic. In this work, we propose a novel streaming implementation of Coordinate Format (COO) sparse matrix-vector multiplication, and study its effectiveness when applied to the Personalized PageRank algorithm, a common building block of recommender systems in e-commerce websites and social networks. Our implementation achieves speedups up to 6x over a reference floating-point FPGA architecture and a state-of-the-art multi-threaded CPU implementation on 8 different data-sets, while preserving the numerical fidelity of the results and reaching up to 42x higher energy efficiency compared to the CPU implementation.
△ Less
Submitted 22 September, 2020;
originally announced September 2020.
-
LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
Authors:
Alberto Zeni,
Giulia Guidi,
Marquita Ellis,
Nan Ding,
Marco D. Santambrogio,
Steven Hofmeyr,
Aydın Buluç,
Leonid Oliker,
Katherine Yelick
Abstract:
Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third-generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact pairwise algorit…
▽ More
Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third-generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact pairwise algorithms for long alignments, the community primarily relies on approximate algorithms that search only for high-quality alignments and stop early when one is not found. In this work, we present the first GPU optimization of the popular X-drop alignment algorithm, that we named LOGAN. Results show that our high-performance multi-GPU implementation achieves up to 181.6 GCUPS and speed-ups up to 6.6x and 30.7x using 1 and 6 NVIDIA Tesla V100, respectively, over the state-of-the-art software running on two IBM Power9 processors using 168 CPU threads, with equivalent accuracy. We also demonstrate a 2.3x LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for sequence alignment implemented in minimap2, a long-read mapping software. To highlight the impact of our work on a real-world application, we couple LOGAN with a many-to-many long-read alignment software called BELLA, and demonstrate that our implementation improves the overall BELLA runtime by up to 10.6x. Finally, we adapt the Roofline model for LOGAN and demonstrate that our implementation is near-optimal on the NVIDIA Tesla V100s.
△ Less
Submitted 12 February, 2020;
originally announced February 2020.
-
A Framework For Identifying Group Behavior Of Wild Animals
Authors:
Guido Muscioni,
Riccardo Pressiani,
Matteo Foglio,
Margaret C. Crofoot,
Marco D. Santambrogio,
Tanya Berger-Wolf
Abstract:
Activity recognition and, more generally, behavior inference tasks are gaining a lot of interest. Much of it is work in the context of human behavior. New available tracking technologies for wild animals are generating datasets that indirectly may provide information about animal behavior. In this work, we propose a method for classifying these data into behavioral annotation, particularly collect…
▽ More
Activity recognition and, more generally, behavior inference tasks are gaining a lot of interest. Much of it is work in the context of human behavior. New available tracking technologies for wild animals are generating datasets that indirectly may provide information about animal behavior. In this work, we propose a method for classifying these data into behavioral annotation, particularly collective behavior of a social group. Our method is based on sequence analysis with a direct encoding of the interactions of a group of wild animals. We evaluate our approach on a real world dataset, showing significant accuracy improvements over baseline methods.
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
Authors:
Yunseong Lee,
Alberto Scolari,
Byung-Gon Chun,
Marco Domenico Santambrogio,
Markus Weimer,
Matteo Interlandi
Abstract:
Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training time, prediction serving has different requirements such as low latency, high throughput and graceful performance degradation under heavy load. Current prediction serving systems consider models as black boxes, whereby prediction-time-speci…
▽ More
Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training time, prediction serving has different requirements such as low latency, high throughput and graceful performance degradation under heavy load. Current prediction serving systems consider models as black boxes, whereby prediction-time-specific optimizations are ignored in favor of ease of deployment. In this paper, we present PRETZEL, a prediction serving system introducing a novel white box architecture enabling both end-to-end and multi-model optimizations. Using production-like model pipelines, our experiments show that PRETZEL is able to introduce performance improvements over different dimensions; compared to state-of-the-art approaches PRETZEL is on average able to reduce 99th percentile latency by 5.5x while reducing memory footprint by 25x, and increasing throughput by 4.7x.
△ Less
Submitted 14 October, 2018;
originally announced October 2018.