-
MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing
Authors:
Nika Mansouri Ghiasi,
Mohammad Sadrosadati,
Harun Mustafa,
Arvid Gollwitzer,
Can Firtina,
Julien Eudine,
Haiyu Mao,
Joël Lindegger,
Meryem Banu Cavlak,
Mohammed Alser,
Jisung Park,
Onur Mutlu
Abstract:
Metagenomics has led to significant advances in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system. In-storag…
▽ More
Metagenomics has led to significant advances in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system. In-storage processing can be a fundamental solution for reducing this overhead. However, designing an in-storage processing system for metagenomics is challenging because existing approaches to metagenomic analysis cannot be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MegIS, the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. We address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS's design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines. Our evaluation shows that MegIS outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7$\times$-37.2$\times$ and 6.9$\times$-100.2$\times$, respectively, while matching the accuracy of the accuracy-optimized tool. MegIS achieves 1.5$\times$-5.1$\times$ speedup compared to the state-of-the-art metagenomic hardware-accelerated (using processing-in-memory) tool, while achieving significantly higher accuracy.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
RowPress Vulnerability in Modern DRAM Chips
Authors:
Haocong Luo,
Ataberk Olgun,
A. Giray Yağlıkçı,
Yahya Can Tuğrul,
Steve Rhyner,
Meryem Banu Cavlak,
Joël Lindegger,
Mohammad Sadrosadati,
Onur Mutlu
Abstract:
Memory isolation is a critical property for system reliability, security, and safety. We demonstrate RowPress, a DRAM read disturbance phenomenon different from the well-known RowHammer. RowPress induces bitflips by keeping a DRAM row open for a long period of time instead of repeatedly opening and closing the row. We experimentally characterize RowPress bitflips, showing their widespread existenc…
▽ More
Memory isolation is a critical property for system reliability, security, and safety. We demonstrate RowPress, a DRAM read disturbance phenomenon different from the well-known RowHammer. RowPress induces bitflips by keeping a DRAM row open for a long period of time instead of repeatedly opening and closing the row. We experimentally characterize RowPress bitflips, showing their widespread existence in commodity off-the-shelf DDR4 DRAM chips. We demonstrate RowPress bitflips in a real system that already has RowHammer protection, and propose effective mitigation techniques that protect DRAM against both RowHammer and RowPress.
△ Less
Submitted 19 August, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
MetaStore: High-Performance Metagenomic Analysis via In-Storage Computing
Authors:
Nika Mansouri Ghiasi,
Mohammad Sadrosadati,
Harun Mustafa,
Arvid Gollwitzer,
Can Firtina,
Julien Eudine,
Haiyu Ma,
Joël Lindegger,
Meryem Banu Cavlak,
Mohammed Alser,
Jisung Park,
Onur Mutlu
Abstract:
Metagenomics has led to significant advancements in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species' genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amo…
▽ More
Metagenomics has led to significant advancements in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species' genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system to the rest of the system. In-storage processing can be a fundamental solution for reducing data movement overhead. However, designing an in-storage processing system for metagenomics is challenging because none of the existing approaches can be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MetaStore, the first in-storage processing system designed to significantly reduce the data movement overhead of end-to-end metagenomic analysis. MetaStore is enabled by our lightweight and cooperative design that effectively leverages and orchestrates processing inside and outside the storage system. Through our detailed analysis of the end-to-end metagenomic analysis pipeline and careful hardware/software co-design, we address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) light-weight in-storage accelerators, and 5) data mapping. Our evaluation shows that MetaStore outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7-37.2$\times$ and 6.9-100.2$\times$, respectively, while matching the accuracy of the accuracy-optimized tool. MetaStore achieves 1.5-5.1$\times$ speedup compared to the state-of-the-art metagenomic hardware-accelerated tool, while achieving significantly higher accuracy.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
MetaTrinity: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation
Authors:
Arvid E. Gollwitzer,
Mohammed Alser,
Joel Bergtholdt,
Joel Lindegger,
Maximilian-David Rumpf,
Can Firtina,
Serghei Mangul,
Onur Mutlu
Abstract:
Metagenomics, the study of genome sequences of diverse organisms cohabiting in a shared environment, has experienced significant advancements across various medical and biological fields. Metagenomic analysis is crucial, for instance, in clinical applications such as infectious disease screening and the diagnosis and early detection of diseases such as cancer. A key task in metagenomics is to dete…
▽ More
Metagenomics, the study of genome sequences of diverse organisms cohabiting in a shared environment, has experienced significant advancements across various medical and biological fields. Metagenomic analysis is crucial, for instance, in clinical applications such as infectious disease screening and the diagnosis and early detection of diseases such as cancer. A key task in metagenomics is to determine the species present in a sample and their relative abundances. Currently, the field is dominated by either alignment-based tools, which offer high accuracy but are computationally expensive, or alignment-free tools, which are fast but lack the needed accuracy for many applications. In response to this dichotomy, we introduce MetaTrinity, a tool based on heuristics, to achieve a fundamental improvement in accuracy-runtime tradeoff over existing methods. We benchmark MetaTrinity against two leading metagenomic classifiers, each representing different ends of the performance-accuracy spectrum. On one end, Kraken2, a tool optimized for performance, shows modest accuracy yet a rapid runtime. The other end of the spectrum is governed by Metalign, a tool optimized for accuracy. Our evaluations show that MetaTrinity achieves an accuracy comparable to Metalign while gaining a 4x speedup without any loss in accuracy. This directly equates to a fourfold improvement in runtime-accuracy tradeoff. Compared to Kraken2, MetaTrinity requires a 5x longer runtime yet delivers a 17x improvement in accuracy. This demonstrates a 3.4x enhancement in the accuracy-runtime tradeoff for MetaTrinity. This dual comparison positions MetaTrinity as a broadly applicable solution for metagenomic classification, combining advantages of both ends of the spectrum: speed and accuracy. MetaTrinity is publicly available at https://github.com/CMU-SAFARI/MetaTrinity.
△ Less
Submitted 23 January, 2025; v1 submitted 3 November, 2023;
originally announced November 2023.
-
SequenceLab: A Comprehensive Benchmark of Computational Methods for Comparing Genomic Sequences
Authors:
Maximilian-David Rumpf,
Mohammed Alser,
Arvid E. Gollwitzer,
Joel Lindegger,
Nour Almadhoun,
Can Firtina,
Serghei Mangul,
Onur Mutlu
Abstract:
Computational complexity is a key limitation of genomic analyses. Thus, over the last 30 years, researchers have proposed numerous fast heuristic methods that provide computational relief. Comparing genomic sequences is one of the most fundamental computational steps in most genomic analyses. Due to its high computational complexity, optimized exact and heuristic algorithms are still being develop…
▽ More
Computational complexity is a key limitation of genomic analyses. Thus, over the last 30 years, researchers have proposed numerous fast heuristic methods that provide computational relief. Comparing genomic sequences is one of the most fundamental computational steps in most genomic analyses. Due to its high computational complexity, optimized exact and heuristic algorithms are still being developed. We find that these methods are highly sensitive to the underlying data, its quality, and various hyperparameters. Despite their wide use, no in-depth analysis has been performed, potentially falsely discarding genetic sequences from further analysis and unnecessarily inflating computational costs. We provide the first analysis and benchmark of this heterogeneity. We deliver an actionable overview of the 11 most widely used state-of-the-art methods for comparing genomic sequences. We also inform readers about their advantages and downsides using thorough experimental evaluation and different real datasets from all major manufacturers (i.e., Illumina, ONT, and PacBio). SequenceLab is publicly available at https://github.com/CMU-SAFARI/SequenceLab.
△ Less
Submitted 23 January, 2025; v1 submitted 25 October, 2023;
originally announced October 2023.
-
An In-Memory Architecture for High-Performance Long-Read Pre-Alignment Filtering
Authors:
Taha Shahroodi,
Michael Miao,
Joel Lindegger,
Stephan Wong,
Onur Mutlu,
Said Hamdioui
Abstract:
With the recent move towards sequencing of accurate long reads, finding solutions that support efficient analysis of these reads becomes more necessary. The long execution time required for sequence alignment of long reads negatively affects genomic studies relying on sequence alignment. Although pre-alignment filtering as an extra step before alignment was recently introduced to mitigate sequence…
▽ More
With the recent move towards sequencing of accurate long reads, finding solutions that support efficient analysis of these reads becomes more necessary. The long execution time required for sequence alignment of long reads negatively affects genomic studies relying on sequence alignment. Although pre-alignment filtering as an extra step before alignment was recently introduced to mitigate sequence alignment for short reads, these filters do not work as efficiently for long reads. Moreover, even with efficient pre-alignment filters, the overall end-to-end (i.e., filtering + original alignment) execution time of alignment for long reads remains high, while the filtering step is now a major portion of the end-to-end execution time.
Our paper makes three contributions. First, it identifies data movement of sequences between memory units and computing units as the main source of inefficiency for pre-alignment filters of long reads. This is because although filters reject many of these long sequencing pairs before they get to the alignment stage, they still require a huge cost regarding time and energy consumption for the large data transferred between memory and processor. Second, this paper introduces an adaptation of a short-read pre-alignment filtering algorithm suitable for long reads. We call this LongGeneGuardian. Finally, it presents Filter-Fuse as an architecture that supports LongGeneGuardian inside the memory. FilterFuse exploits the Computation-In-Memory computing paradigm, eliminating the cost of data movement in LongGeneGuardian.
Our evaluations show that FilterFuse improves the execution time of filtering by 120.47x for long reads compared to State-of-the-Art (SoTA) filter, SneakySnake. FilterFuse also improves the end-to-end execution time of sequence alignment by up to 49.14x and 5207.63x compared to SneakySnake with SoTA aligner and only SoTA aligner, respectively.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors
Authors:
Taha Shahroodi,
Gagandeep Singh,
Mahdi Zahedi,
Haiyu Mao,
Joel Lindegger,
Can Firtina,
Stephan Wong,
Onur Mutlu,
Said Hamdioui
Abstract:
Basecalling, an essential step in many genome analysis studies, relies on large Deep Neural Networks (DNNs) to achieve high accuracy. Unfortunately, these DNNs are computationally slow and inefficient, leading to considerable delays and resource constraints in the sequence analysis process. A Computation-In-Memory (CIM) architecture using memristors can significantly accelerate the performance of…
▽ More
Basecalling, an essential step in many genome analysis studies, relies on large Deep Neural Networks (DNNs) to achieve high accuracy. Unfortunately, these DNNs are computationally slow and inefficient, leading to considerable delays and resource constraints in the sequence analysis process. A Computation-In-Memory (CIM) architecture using memristors can significantly accelerate the performance of DNNs. However, inherent device non-idealities and architectural limitations of such designs can greatly degrade the basecalling accuracy, which is critical for accurate genome analysis. To facilitate the adoption of memristor-based CIM designs for basecalling, it is important to (1) conduct a comprehensive analysis of potential CIM architectures and (2) develop effective strategies for mitigating the possible adverse effects of inherent device non-idealities and architectural limitations.
This paper proposes Swordfish, a novel hardware/software co-design framework that can effectively address the two aforementioned issues. Swordfish incorporates seven circuit and device restrictions or non-idealities from characterized real memristor-based chips. Swordfish leverages various hardware/software co-design solutions to mitigate the basecalling accuracy loss due to such non-idealities. To demonstrate the effectiveness of Swordfish, we take Bonito, the state-of-the-art (i.e., accurate and fast), open-source basecaller as a case study. Our experimental results using Sword-fish show that a CIM architecture can realistically accelerate Bonito for a wide range of real datasets by an average of 25.7x, with an accuracy loss of 6.01%.
△ Less
Submitted 26 November, 2023; v1 submitted 6 October, 2023;
originally announced October 2023.
-
RowPress: Amplifying Read Disturbance in Modern DRAM Chips
Authors:
Haocong Luo,
Ataberk Olgun,
A. Giray Yağlıkçı,
Yahya Can Tuğrul,
Steve Rhyner,
Meryem Banu Cavlak,
Joël Lindegger,
Mohammad Sadrosadati,
Onur Mutlu
Abstract:
Memory isolation is critical for system reliability, security, and safety. Unfortunately, read disturbance can break memory isolation in modern DRAM chips. For example, RowHammer is a well-studied read-disturb phenomenon where repeatedly opening and closing (i.e., hammering) a DRAM row many times causes bitflips in physically nearby rows.
This paper experimentally demonstrates and analyzes anoth…
▽ More
Memory isolation is critical for system reliability, security, and safety. Unfortunately, read disturbance can break memory isolation in modern DRAM chips. For example, RowHammer is a well-studied read-disturb phenomenon where repeatedly opening and closing (i.e., hammering) a DRAM row many times causes bitflips in physically nearby rows.
This paper experimentally demonstrates and analyzes another widespread read-disturb phenomenon, RowPress, in real DDR4 DRAM chips. RowPress breaks memory isolation by keeping a DRAM row open for a long period of time, which disturbs physically nearby rows enough to cause bitflips. We show that RowPress amplifies DRAM's vulnerability to read-disturb attacks by significantly reducing the number of row activations needed to induce a bitflip by one to two orders of magnitude under realistic conditions. In extreme cases, RowPress induces bitflips in a DRAM row when an adjacent row is activated only once. Our detailed characterization of 164 real DDR4 DRAM chips shows that RowPress 1) affects chips from all three major DRAM manufacturers, 2) gets worse as DRAM technology scales down to smaller node sizes, and 3) affects a different set of DRAM cells from RowHammer and behaves differently from RowHammer as temperature and access pattern changes.
We demonstrate in a real DDR4-based system with RowHammer protection that 1) a user-level program induces bitflips by leveraging RowPress while conventional RowHammer cannot do so, and 2) a memory controller that adaptively keeps the DRAM row open for a longer period of time based on access pattern can facilitate RowPress-based attacks. To prevent bitflips due to RowPress, we describe and evaluate a new methodology that adapts existing RowHammer mitigation techniques to also mitigate RowPress with low additional performance overhead. We open source all our code and data to facilitate future research on RowPress.
△ Less
Submitted 28 March, 2024; v1 submitted 29 June, 2023;
originally announced June 2023.
-
TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering
Authors:
Meryem Banu Cavlak,
Gagandeep Singh,
Mohammed Alser,
Can Firtina,
Joël Lindegger,
Mohammad Sadrosadati,
Nika Mansouri Ghiasi,
Can Alkan,
Onur Mutlu
Abstract:
Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally inefficient and memory-hungry, bottlenecking the entire genome analysis pipeline. However, for…
▽ More
Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally inefficient and memory-hungry, bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. Our thorough experimental evaluations show that TargetCall 1) improves the end-to-end basecalling runtime performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) recall in keeping on-target reads, 2) maintains high accuracy in downstream analysis, and 3) achieves better runtime performance, throughput, recall, precision, and generality compared to prior works. TargetCall is available at https://github.com/CMU-SAFARI/TargetCall.
△ Less
Submitted 23 October, 2024; v1 submitted 9 December, 2022;
originally announced December 2022.
-
Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources
Authors:
Sina Darabi,
Mohammad Sadrosadati,
Joël Lindegger,
Negar Akbarzadeh,
Mohammad Hosseini,
Jisung Park,
Juan Gómez-Luna,
Hamid Sarbazi-Azad,
Onur Mutlu
Abstract:
Graphics Processing Units (GPUs) are widely-used accelerators for data-parallel applications. In many GPU applications, GPU memory bandwidth bottlenecks performance, causing underutilization of GPU cores. Hence, disabling many cores does not affect the performance of memory-bound workloads. While simply power-gating unused GPU cores would save energy, prior works attempt to better utilize GPU core…
▽ More
Graphics Processing Units (GPUs) are widely-used accelerators for data-parallel applications. In many GPU applications, GPU memory bandwidth bottlenecks performance, causing underutilization of GPU cores. Hence, disabling many cores does not affect the performance of memory-bound workloads. While simply power-gating unused GPU cores would save energy, prior works attempt to better utilize GPU cores for other applications (ideally compute-bound), which increases the GPU's total throughput.
In this paper, we introduce Morpheus, a new hardware/software co-designed technique to boost the performance of memory-bound applications. The key idea of Morpheus is to exploit unused core resources to extend the GPU last level cache (LLC) capacity. In Morpheus, each GPU core has two execution modes: compute mode and cache mode. Cores in compute mode operate conventionally and run application threads. However, for the cores in cache mode, Morpheus invokes a software helper kernel that uses the cores' on-chip memories (i.e., register file, shared memory, and L1) in a way that extends the LLC capacity for a running memory-bound workload. Morpheus adds a controller to the GPU hardware to forward LLC requests to either the conventional LLC (managed by hardware) or the extended LLC (managed by the helper kernel). Our experimental results show that Morpheus improves the performance and energy efficiency of a baseline GPU architecture by an average of 39% and 58%, respectively, across several memory-bound workloads. Morpheus' performance is within 3% of a GPU design that has a quadruple-sized conventional LLC. Morpheus can thus contribute to reducing the hardware dedicated to a conventional LLC by exploiting idle cores' on-chip memory resources as additional cache capacity.
△ Less
Submitted 6 April, 2023; v1 submitted 22 September, 2022;
originally announced September 2022.
-
Scrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for CPUs, GPUs, and ASICs
Authors:
Joël Lindegger,
Damla Senol Cali,
Mohammed Alser,
Juan Gómez-Luna,
Nika Mansouri Ghiasi,
Onur Mutlu
Abstract:
Pairwise sequence alignment is a very time-consuming step in common bioinformatics pipelines. Speeding up this step requires heuristics, efficient implementations, and/or hardware acceleration. A promising candidate for all of the above is the recently proposed GenASM algorithm. We identify and address three inefficiencies in the GenASM algorithm: it has a high amount of data movement, a large mem…
▽ More
Pairwise sequence alignment is a very time-consuming step in common bioinformatics pipelines. Speeding up this step requires heuristics, efficient implementations, and/or hardware acceleration. A promising candidate for all of the above is the recently proposed GenASM algorithm. We identify and address three inefficiencies in the GenASM algorithm: it has a high amount of data movement, a large memory footprint, and does some unnecessary work. We propose Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm. We provide efficient open-source implementations of the Scrooge algorithm for CPUs and GPUs, which demonstrate the significant benefits of our algorithmic improvements. For long reads, the CPU version of Scrooge achieves a 20.1x, 1.7x, and 2.1x speedup over KSW2, Edlib, and a CPU implementation of GenASM, respectively. The GPU version of Scrooge achieves a 4.0x 80.4x, 6.8x, 12.6x and 5.9x speedup over the CPU version of Scrooge, KSW2, Edlib, Darwin-GPU, and a GPU implementation of GenASM, respectively. We estimate an ASIC implementation of Scrooge to use 3.6x less chip area and 2.1x less power than a GenASM ASIC while maintaining the same throughput. Further, we systematically analyze the throughput and accuracy behavior of GenASM and Scrooge under various configurations. As the best configuration of Scrooge depends on the computing platform, we make several observations that can help guide future implementations of Scrooge. Availability: https://github.com/CMU-SAFARI/Scrooge
△ Less
Submitted 12 April, 2023; v1 submitted 21 August, 2022;
originally announced August 2022.
-
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis
Authors:
Can Firtina,
Kamlesh Pillai,
Gurpreet S. Kalsi,
Bharathwaj Suresh,
Damla Senol Cali,
Jeremie Kim,
Taha Shahroodi,
Meryem Banu Cavlak,
Joel Lindegger,
Mohammed Alser,
Juan Gómez Luna,
Sreenivas Subramoney,
Onur Mutlu
Abstract:
Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highl…
▽ More
Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. We identify an urgent need for a flexible, high-performance, and energy-efficient HW/SW co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs.
We introduce ApHMM, the first flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM tackles the major inefficiencies in the Baum-Welch algorithm by 1) designing flexible hardware to accommodate various pHMM designs, 2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, 3) rapidly filtering out negligible computations using a hardware-based filter, and 4) minimizing redundant computations.
ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and 27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: 1) error correction, 2) protein family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x - 1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency by 64.24x - 115.46x, 1.75x, 1.96x.
△ Less
Submitted 21 October, 2023; v1 submitted 20 July, 2022;
originally announced July 2022.
-
Going From Molecules to Genomic Variations to Scientific Discovery: Intelligent Algorithms and Architectures for Intelligent Genome Analysis
Authors:
Mohammed Alser,
Joel Lindegger,
Can Firtina,
Nour Almadhoun,
Haiyu Mao,
Gagandeep Singh,
Juan Gomez-Luna,
Onur Mutlu
Abstract:
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologie…
▽ More
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent. The analysis script and data used in our experimental evaluation are available at: https://github.com/CMU-SAFARI/Molecules2Variations
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping
Authors:
Damla Senol Cali,
Konstantinos Kanellopoulos,
Joel Lindegger,
Zülal Bingöl,
Gurpreet S. Kalsi,
Ziyi Zuo,
Can Firtina,
Meryem Banu Cavlak,
Jeremie Kim,
Nika Mansouri Ghiasi,
Gagandeep Singh,
Juan Gómez-Luna,
Nour Almadhoun Alserr,
Mohammed Alser,
Sreenivas Subramoney,
Can Alkan,
Saugata Ghose,
Onur Mutlu
Abstract:
A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in…
▽ More
A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available.
We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping.
To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator.
We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph and sequence-to-sequence mapping pipelines.
△ Less
Submitted 31 May, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Algorithmic Improvement and GPU Acceleration of the GenASM Algorithm
Authors:
Joël Lindegger,
Damla Senol Cali,
Mohammed Alser,
Juan Gómez-Luna,
Onur Mutlu
Abstract:
We improve on GenASM, a recent algorithm for genomic sequence alignment, by significantly reducing its memory footprint and bandwidth requirement. Our algorithmic improvements reduce the memory footprint by 24$\times$ and the number of memory accesses by 12$\times$. We efficiently parallelize the algorithm for GPUs, achieving a 4.1$\times$ speedup over a CPU implementation of the same algorithm, a…
▽ More
We improve on GenASM, a recent algorithm for genomic sequence alignment, by significantly reducing its memory footprint and bandwidth requirement. Our algorithmic improvements reduce the memory footprint by 24$\times$ and the number of memory accesses by 12$\times$. We efficiently parallelize the algorithm for GPUs, achieving a 4.1$\times$ speedup over a CPU implementation of the same algorithm, a 62$\times$ speedup over minimap2's CPU-based KSW2 and a 7.2$\times$ speedup over the CPU-based Edlib for long reads.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.