-
MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem
Authors:
Melina Soysal,
Konstantina Koliogeorgi,
Can Firtina,
Nika Mansouri Ghiasi,
Rakesh Nadig,
Haiyu Mayo,
Geraldo F. Oliveira,
Yu Liang,
Klea Zambaku,
Mohammad Sadrosadati,
Onur Mutlu
Abstract:
Raw signal genome analysis (RSGA) has emerged as a promising approach to enable real-time genome analysis by directly analyzing raw electrical signals. However, rapid advancements in sequencing technologies make it increasingly difficult for software-based RSGA to match the throughput of raw signal generation. This paper demonstrates that while hardware acceleration techniques can significantly ac…
▽ More
Raw signal genome analysis (RSGA) has emerged as a promising approach to enable real-time genome analysis by directly analyzing raw electrical signals. However, rapid advancements in sequencing technologies make it increasingly difficult for software-based RSGA to match the throughput of raw signal generation. This paper demonstrates that while hardware acceleration techniques can significantly accelerate RSGA, the high volume of genomic data shifts the performance and energy bottleneck from computation to I/O data movement. As sequencing throughput increases, I/O overhead becomes the main contributor to both runtime and energy consumption. Therefore, there is a need to design a high-performance, energy-efficient system for RSGA that can both alleviate the data movement bottleneck and provide large acceleration capabilities. We propose MARS, a storage-centric system that leverages the heterogeneous resources within modern storage systems (e.g., storage-internal DRAM, storage controller, flash chips) alongside their large storage capacity to tackle both data movement and computational overheads of RSGA in an area-efficient and low-cost manner. MARS accelerates RSGA through a novel hardware/software co-design approach. First, MARS modifies the RSGA pipeline via two filtering mechanisms and a quantization scheme, reducing hardware demands and optimizing for in-storage execution. Second, MARS accelerates the RSGA steps directly within the storage by leveraging both Processing-Near-Memory and Processing-Using-Memory paradigms. Third, MARS orchestrates the execution of all steps to fully exploit in-storage parallelism and minimize data movement. Our evaluation shows that MARS outperforms basecalling-based software and hardware-accelerated state-of-the-art read mapping pipelines by 93x and 40x, on average across different datasets, while reducing their energy consumption by 427x and 72x.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Proteus: Enabling High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible Arithmetic
Authors:
Geraldo F. Oliveira,
Mayank Kabra,
Yuxin Guo,
Kangqi Chen,
A. Giray Yağlıkçı,
Melina Soysal,
Mohammad Sadrosadati,
Joaquin Olivares Bueno,
Saugata Ghose,
Juan Gómez-Luna,
Onur Mutlu
Abstract:
Processing-using-DRAM (PUD) is a paradigm where the analog operational properties of DRAM are used to perform bulk logic operations. While PUD promises high throughput at low energy and area cost, we uncover three limitations of existing PUD approaches that lead to significant inefficiencies: (i) static data representation, i.e., two's complement with fixed bit-precision, leading to unnecessary co…
▽ More
Processing-using-DRAM (PUD) is a paradigm where the analog operational properties of DRAM are used to perform bulk logic operations. While PUD promises high throughput at low energy and area cost, we uncover three limitations of existing PUD approaches that lead to significant inefficiencies: (i) static data representation, i.e., two's complement with fixed bit-precision, leading to unnecessary computation over useless (i.e., inconsequential) data; (ii) support for only throughput-oriented execution, where the high latency of individual PUD operations can only be hidden in the presence of bulk data-level parallelism; and (iii) high latency for high-precision (e.g., 32-bit) operations.
To address these issues, we propose Proteus, the first hardware framework that addresses the high execution latency of bulk bitwise PUD operations by implementing a data-aware runtime engine for PUD. Proteus reduces the latency of PUD operations in three different ways: (i) Proteus dynamically reduces the bit-precision (and thus the latency and energy consumption) of PUD operations by exploiting narrow values (i.e., values with many leading zeros or ones); (ii) Proteus concurrently executes independent in-DRAM primitives belonging to a single PUD operation across multiple DRAM arrays; (iii) Proteus chooses and uses the most appropriate data representation and arithmetic algorithm implementation for a given PUD instruction transparently to the programmer.
△ Less
Submitted 12 June, 2025; v1 submitted 29 January, 2025;
originally announced January 2025.
-
Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips: Experimental Characterization and Analysis
Authors:
Ismail Emir Yuksel,
Yahya Can Tugrul,
F. Nisa Bostanci,
Geraldo F. Oliveira,
A. Giray Yaglikci,
Ataberk Olgun,
Melina Soysal,
Haocong Luo,
Juan Gómez-Luna,
Mohammad Sadrosadati,
Onur Mutlu
Abstract:
We experimentally analyze the computational capability of commercial off-the-shelf (COTS) DRAM chips and the robustness of these capabilities under various timing delays between DRAM commands, data patterns, temperature, and voltage levels. We extensively characterize 120 COTS DDR4 chips from two major manufacturers. We highlight four key results of our study. First, COTS DRAM chips are capable of…
▽ More
We experimentally analyze the computational capability of commercial off-the-shelf (COTS) DRAM chips and the robustness of these capabilities under various timing delays between DRAM commands, data patterns, temperature, and voltage levels. We extensively characterize 120 COTS DDR4 chips from two major manufacturers. We highlight four key results of our study. First, COTS DRAM chips are capable of 1) simultaneously activating up to 32 rows (i.e., simultaneous many-row activation), 2) executing a majority of X (MAJX) operation where X>3 (i.e., MAJ5, MAJ7, and MAJ9 operations), and 3) copying a DRAM row (concurrently) to up to 31 other DRAM rows, which we call Multi-RowCopy. Second, storing multiple copies of MAJX's input operands on all simultaneously activated rows drastically increases the success rate (i.e., the percentage of DRAM cells that correctly perform the computation) of the MAJX operation. For example, MAJ3 with 32-row activation (i.e., replicating each MAJ3's input operands 10 times) has a 30.81% higher average success rate than MAJ3 with 4-row activation (i.e., no replication). Third, data pattern affects the success rate of MAJX and Multi-RowCopy operations by 11.52% and 0.07% on average. Fourth, simultaneous many-row activation, MAJX, and Multi-RowCopy operations are highly resilient to temperature and voltage changes, with small success rate variations of at most 2.13% among all tested operations. We believe these empirical results demonstrate the promising potential of using DRAM as a computation substrate. To aid future research and development, we open-source our infrastructure at https://github.com/CMU-SAFARI/SiMRA-DRAM.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
PULSAR: Simultaneous Many-Row Activation for Reliable and High-Performance Computing in Off-the-Shelf DRAM Chips
Authors:
Ismail Emir Yuksel,
Yahya Can Tugrul,
F. Nisa Bostanci,
Abdullah Giray Yaglikci,
Ataberk Olgun,
Geraldo F. Oliveira,
Melina Soysal,
Haocong Luo,
Juan Gomez Luna,
Mohammad Sadrosadati,
Onur Mutlu
Abstract:
Data movement between the processor and the main memory is a first-order obstacle against improving performance and energy efficiency in modern systems. To address this obstacle, Processing-using-Memory (PuM) is a promising approach where bulk-bitwise operations are performed leveraging intrinsic analog properties within the DRAM array and massive parallelism across DRAM columns. Unfortunately, 1)…
▽ More
Data movement between the processor and the main memory is a first-order obstacle against improving performance and energy efficiency in modern systems. To address this obstacle, Processing-using-Memory (PuM) is a promising approach where bulk-bitwise operations are performed leveraging intrinsic analog properties within the DRAM array and massive parallelism across DRAM columns. Unfortunately, 1) modern off-the-shelf DRAM chips do not officially support PuM operations, and 2) existing techniques of performing PuM operations on off-the-shelf DRAM chips suffer from two key limitations. First, these techniques have low success rates, i.e., only a small fraction of DRAM columns can correctly execute PuM operations because they operate beyond manufacturer-recommended timing constraints, causing these operations to be highly susceptible to noise and process variation. Second, these techniques have limited compute primitives, preventing them from fully leveraging parallelism across DRAM columns and thus hindering their performance benefits.
We propose PULSAR, a new technique to enable high-success-rate and high-performance PuM operations in off-the-shelf DRAM chips. PULSAR leverages our new observation that a carefully crafted sequence of DRAM commands simultaneously activates up to 32 DRAM rows. PULSAR overcomes the limitations of existing techniques by 1) replicating the input data to improve the success rate and 2) enabling new bulk bitwise operations (e.g., many-input majority, Multi-RowInit, and Bulk-Write) to improve the performance.
Our analysis on 120 off-the-shelf DDR4 chips from two major manufacturers shows that PULSAR achieves a 24.18% higher success rate and 121% higher performance over seven arithmetic-logic operations compared to FracDRAM, a state-of-the-art off-the-shelf DRAM-based PuM technique.
△ Less
Submitted 18 March, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
MuRiT: Efficient Computation of Pathwise Persistence Barcodes in Multi-Filtered Flag Complexes via Vietoris-Rips Transformations
Authors:
Maximilian Neumann,
Michael Bleher,
Lukas Hahn,
Samuel Braun,
Holger Obermaier,
Mehmet Soysal,
René Caspart,
Andreas Ott
Abstract:
Multi-parameter persistent homology naturally arises in applications of persistent topology to data that come with extra information depending on additional parameters, like for example time series data. We introduce the concept of a Vietoris-Rips transformation, a method that reduces the computation of the one-parameter persistent homology of pathwise subcomplexes in multi-filtered flag complexes…
▽ More
Multi-parameter persistent homology naturally arises in applications of persistent topology to data that come with extra information depending on additional parameters, like for example time series data. We introduce the concept of a Vietoris-Rips transformation, a method that reduces the computation of the one-parameter persistent homology of pathwise subcomplexes in multi-filtered flag complexes to the computation of the Vietoris-Rips persistent homology of certain semimetric spaces. The corresponding pathwise persistence barcodes track persistence features of the ambient multi-filtered complex and can in particular be used to recover the rank invariant in multi-parameter persistent homology. We present MuRiT, a scalable algorithm that computes the pathwise persistence barcodes of multi-filtered flag complexes by means of Vietoris-Rips transformations. Moreover, we provide an efficient software implementation of the MuRiT algorithm which resorts to Ripser for the actual computation of Vietoris-Rips persistence barcodes. To demonstrate the applicability of MuRiT to real-world datasets, we establish MuRiT as part of our CoVtRec pipeline for the surveillance of the convergent evolution of the coronavirus SARS-CoV-2 in the current COVID-19 pandemic.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.