-
PuDHammer: Experimental Analysis of Read Disturbance Effects of Processing-using-DRAM in Real DRAM Chips
Authors:
Ismail Emir Yuksel,
Akash Sood,
Ataberk Olgun,
Oğuzhan Canpolat,
Haocong Luo,
F. Nisa Bostancı,
Mohammad Sadrosadati,
A. Giray Yağlıkçı,
Onur Mutlu
Abstract:
Processing-using-DRAM (PuD) is a promising paradigm for alleviating the data movement bottleneck using DRAM's massive internal parallelism and bandwidth to execute very wide operations. Performing a PuD operation involves activating multiple DRAM rows in quick succession or simultaneously, i.e., multiple-row activation. Multiple-row activation is fundamentally different from conventional memory ac…
▽ More
Processing-using-DRAM (PuD) is a promising paradigm for alleviating the data movement bottleneck using DRAM's massive internal parallelism and bandwidth to execute very wide operations. Performing a PuD operation involves activating multiple DRAM rows in quick succession or simultaneously, i.e., multiple-row activation. Multiple-row activation is fundamentally different from conventional memory access patterns that activate one DRAM row at a time. However, repeatedly activating even one DRAM row (e.g., RowHammer) can induce bitflips in unaccessed DRAM rows because modern DRAM is subject to read disturbance. Unfortunately, no prior work investigates the effects of multiple-row activation on DRAM read disturbance.
In this paper, we present the first characterization study of read disturbance effects of multiple-row activation-based PuD (which we call PuDHammer) using 316 real DDR4 DRAM chips from four major DRAM manufacturers. Our detailed characterization show that 1) PuDHammer significantly exacerbates the read disturbance vulnerability, causing up to 158.58x reduction in the minimum hammer count required to induce the first bitflip ($HC_{first}$), compared to RowHammer, 2) PuDHammer is affected by various operational conditions and parameters, 3) combining RowHammer with PuDHammer is more effective than using RowHammer alone to induce read disturbance error, e.g., doing so reduces $HC_{first}$ by 1.66x on average, and 4) PuDHammer bypasses an in-DRAM RowHammer mitigation mechanism (Target Row Refresh) and induces more bitflips than RowHammer.
To develop future robust PuD-enabled systems in the presence of PuDHammer, we 1) develop three countermeasures and 2) adapt and evaluate the state-of-the-art RowHammer mitigation standardized by industry, called Per Row Activation Counting (PRAC). We show that the adapted PRAC incurs large performance overheads (48.26%, on average).
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Understanding and Mitigating Side and Covert Channel Vulnerabilities Introduced by RowHammer Defenses
Authors:
F. Nisa Bostancı,
Oğuzhan Canpolat,
Ataberk Olgun,
İsmail Emir Yüksel,
Konstantinos Kanellopoulos,
Mohammad Sadrosadati,
A. Giray Yağlıkçı,
Onur Mutlu
Abstract:
DRAM chips are vulnerable to read disturbance phenomena (e.g., RowHammer and RowPress), where repeatedly accessing or keeping open a DRAM row causes bitflips in nearby rows. Attackers leverage RowHammer bitflips in real systems to take over systems and leak data. Consequently, many prior works propose mitigations, including recent DDR specifications introducing new mitigations (e.g., PRAC and RFM)…
▽ More
DRAM chips are vulnerable to read disturbance phenomena (e.g., RowHammer and RowPress), where repeatedly accessing or keeping open a DRAM row causes bitflips in nearby rows. Attackers leverage RowHammer bitflips in real systems to take over systems and leak data. Consequently, many prior works propose mitigations, including recent DDR specifications introducing new mitigations (e.g., PRAC and RFM). For robust operation, it is critical to analyze other security implications of RowHammer mitigations. Unfortunately, no prior work analyzes the timing covert and side channel vulnerabilities introduced by RowHammer mitigations.
This paper presents the first analysis and evaluation of timing covert and side channel vulnerabilities introduced by state-of-the-art RowHammer mitigations. We demonstrate that RowHammer mitigations' preventive actions have two fundamental features that enable timing channels. First, preventive actions reduce DRAM bandwidth availability, resulting in longer memory latencies. Second, preventive actions can be triggered on demand depending on memory access patterns.
We introduce LeakyHammer, a new class of attacks that leverage the RowHammer mitigation-induced memory latency differences to establish communication channels and leak secrets. First, we build two covert channel attacks exploiting two state-of-the-art RowHammer mitigations, achieving 38.6 Kbps and 48.6 Kbps channel capacity. Second, we demonstrate a website fingerprinting attack that identifies visited websites based on the RowHammer-preventive actions they cause. We propose and evaluate two countermeasures against LeakyHammer and show that fundamentally mitigating LeakyHammer induces large overheads in highly RowHammer-vulnerable systems. We believe and hope our work can enable and aid future work on designing robust systems against RowHammer mitigation-based side and covert channels.
△ Less
Submitted 22 April, 2025; v1 submitted 22 March, 2025;
originally announced March 2025.
-
Variable Read Disturbance: An Experimental Analysis of Temporal Variation in DRAM Read Disturbance
Authors:
Ataberk Olgun,
F. Nisa Bostanci,
Ismail Emir Yuksel,
Oguzhan Canpolat,
Haocong Luo,
Geraldo F. Oliveira,
A. Giray Yaglikci,
Minesh Patel,
Onur Mutlu
Abstract:
Modern DRAM chips are subject to read disturbance errors. State-of-the-art read disturbance mitigations rely on accurate and exhaustive characterization of the read disturbance threshold (RDT) (e.g., the number of aggressor row activations needed to induce the first RowHammer or RowPress bitflip) of every DRAM row (of which there are millions or billions in a modern system) to prevent read disturb…
▽ More
Modern DRAM chips are subject to read disturbance errors. State-of-the-art read disturbance mitigations rely on accurate and exhaustive characterization of the read disturbance threshold (RDT) (e.g., the number of aggressor row activations needed to induce the first RowHammer or RowPress bitflip) of every DRAM row (of which there are millions or billions in a modern system) to prevent read disturbance bitflips securely and with low overhead. We experimentally demonstrate for the first time that the RDT of a DRAM row significantly and unpredictably changes over time. We call this new phenomenon variable read disturbance (VRD). Our experiments using 160 DDR4 chips and 4 HBM2 chips from three major manufacturers yield two key observations. First, it is very unlikely that relatively few RDT measurements can accurately identify the RDT of a DRAM row. The minimum RDT of a DRAM row appears after tens of thousands of measurements (e.g., up to 94,467), and the minimum RDT of a DRAM row is 3.5X smaller than the maximum RDT observed for that row. Second, the probability of accurately identifying a row's RDT with a relatively small number of measurements reduces with increasing chip density or smaller technology node size. Our empirical results have implications for the security guarantees of read disturbance mitigation techniques: if the RDT of a DRAM row is not identified accurately, these techniques can easily become insecure. We discuss and evaluate using a guardband for RDT and error-correcting codes for mitigating read disturbance bitflips in the presence of RDTs that change unpredictably over time. We conclude that a >10% guardband for the minimum observed RDT combined with SECDED or Chipkill-like SSC error-correcting codes could prevent read disturbance bitflips at the cost of large read disturbance mitigation performance overheads (e.g., 45% performance loss for an RDT guardband of 50%).
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips: Experimental Characterization and Analysis
Authors:
Ismail Emir Yuksel,
Yahya Can Tugrul,
F. Nisa Bostanci,
Geraldo F. Oliveira,
A. Giray Yaglikci,
Ataberk Olgun,
Melina Soysal,
Haocong Luo,
Juan Gómez-Luna,
Mohammad Sadrosadati,
Onur Mutlu
Abstract:
We experimentally analyze the computational capability of commercial off-the-shelf (COTS) DRAM chips and the robustness of these capabilities under various timing delays between DRAM commands, data patterns, temperature, and voltage levels. We extensively characterize 120 COTS DDR4 chips from two major manufacturers. We highlight four key results of our study. First, COTS DRAM chips are capable of…
▽ More
We experimentally analyze the computational capability of commercial off-the-shelf (COTS) DRAM chips and the robustness of these capabilities under various timing delays between DRAM commands, data patterns, temperature, and voltage levels. We extensively characterize 120 COTS DDR4 chips from two major manufacturers. We highlight four key results of our study. First, COTS DRAM chips are capable of 1) simultaneously activating up to 32 rows (i.e., simultaneous many-row activation), 2) executing a majority of X (MAJX) operation where X>3 (i.e., MAJ5, MAJ7, and MAJ9 operations), and 3) copying a DRAM row (concurrently) to up to 31 other DRAM rows, which we call Multi-RowCopy. Second, storing multiple copies of MAJX's input operands on all simultaneously activated rows drastically increases the success rate (i.e., the percentage of DRAM cells that correctly perform the computation) of the MAJX operation. For example, MAJ3 with 32-row activation (i.e., replicating each MAJ3's input operands 10 times) has a 30.81% higher average success rate than MAJ3 with 4-row activation (i.e., no replication). Third, data pattern affects the success rate of MAJX and Multi-RowCopy operations by 11.52% and 0.07% on average. Fourth, simultaneous many-row activation, MAJX, and Multi-RowCopy operations are highly resilient to temperature and voltage changes, with small success rate variations of at most 2.13% among all tested operations. We believe these empirical results demonstrate the promising potential of using DRAM as a computation substrate. To aid future research and development, we open-source our infrastructure at https://github.com/CMU-SAFARI/SiMRA-DRAM.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Revisiting Main Memory-Based Covert and Side Channel Attacks in the Context of Processing-in-Memory
Authors:
F. Nisa Bostanci,
Konstantinos Kanellopoulos,
Ataberk Olgun,
A. Giray Yaglikci,
Ismail Emir Yuksel,
Nika Mansouri Ghiasi,
Zulal Bingol,
Mohammad Sadrosadati,
Onur Mutlu
Abstract:
We introduce IMPACT, a set of high-throughput main memory-based timing attacks that leverage characteristics of processing-in-memory (PiM) architectures to establish covert and side channels. IMPACT enables high-throughput communication and private information leakage by exploiting the shared DRAM row buffer. To achieve high throughput, IMPACT (i) eliminates expensive cache bypassing steps require…
▽ More
We introduce IMPACT, a set of high-throughput main memory-based timing attacks that leverage characteristics of processing-in-memory (PiM) architectures to establish covert and side channels. IMPACT enables high-throughput communication and private information leakage by exploiting the shared DRAM row buffer. To achieve high throughput, IMPACT (i) eliminates expensive cache bypassing steps required by processor-centric memory-based timing attacks and (ii) leverages the intrinsic parallelism of PiM operations. We showcase two applications of IMPACT. First, we build two covert channels that leverage different PiM approaches (i.e., processing-near-memory and processing-using-memory) to establish high-throughput covert communication channels. Our covert channels achieve 8.2 Mb/s and 14.8 Mb/s communication throughput, respectively, which is 3.6x and 6.5x higher than the state-of-the-art main memory-based covert channel. Second, we showcase a side-channel attack that leaks private information of concurrently-running victim applications with a low error rate. Our source-code is openly and freely available at https://github.com/CMU-SAFARI/IMPACT.
△ Less
Submitted 12 June, 2025; v1 submitted 17 April, 2024;
originally announced April 2024.
-
Virtuoso: Enabling Fast and Accurate Virtual Memory Research via an Imitation-based Operating System Simulation Methodology
Authors:
Konstantinos Kanellopoulos,
Konstantinos Sgouras,
F. Nisa Bostanci,
Andreas Kosmas Kakolyris,
Berkin Kerim Konar,
Rahul Bera,
Mohammad Sadrosadati,
Rakesh Kumar,
Nandita Vijaykumar,
Onur Mutlu
Abstract:
The unprecedented growth in data demand from emerging applications has turned virtual memory (VM) into a major performance bottleneck. Researchers explore new hardware/OS co-designs to optimize VM across diverse applications and systems. To evaluate such designs, researchers rely on various simulation methodologies to model VM components.Unfortunately, current simulation tools (i) either lack the…
▽ More
The unprecedented growth in data demand from emerging applications has turned virtual memory (VM) into a major performance bottleneck. Researchers explore new hardware/OS co-designs to optimize VM across diverse applications and systems. To evaluate such designs, researchers rely on various simulation methodologies to model VM components.Unfortunately, current simulation tools (i) either lack the desired accuracy in modeling VM's software components or (ii) are too slow and complex to prototype and evaluate schemes that span across the hardware/software boundary.
We introduce Virtuoso, a new simulation framework that enables quick and accurate prototyping and evaluation of the software and hardware components of the VM subsystem. The key idea of Virtuoso is to employ a lightweight userspace OS kernel, called MimicOS, that (i) accelerates simulation time by imitating only the desired kernel functionalities, (ii) facilitates the development of new OS routines that imitate real ones, using an accessible high-level programming interface, (iii) enables accurate and flexible evaluation of the application- and system-level implications of VM after integrating Virtuoso to a desired architectural simulator.
We integrate Virtuoso into five diverse architectural simulators, each specializing in different aspects of system design, and heavily enrich it with multiple state-of-the-art VM schemes. Our validation shows that Virtuoso ported on top of Sniper, a state-of-the-art microarchitectural simulator, models the memory management unit of a real high-end server-grade page fault latency of a real Linux kernel with high accuracy . Consequently, Virtuoso models the IPC performance of a real high-end server-grade CPU with 21% higher accuracy than the baseline version of Sniper. The source code of Virtuoso is freely available at https://github.com/CMU-SAFARI/Virtuoso.
△ Less
Submitted 26 March, 2025; v1 submitted 7 March, 2024;
originally announced March 2024.
-
MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Processing
Authors:
Geraldo F. Oliveira,
Ataberk Olgun,
Abdullah Giray Yağlıkçı,
F. Nisa Bostancı,
Juan Gómez-Luna,
Saugata Ghose,
Onur Mutlu
Abstract:
Processing-using-DRAM (PUD) is a processing-in-memory (PIM) approach that uses a DRAM array's massive internal parallelism to execute very-wide data-parallel operations, in a single-instruction multiple-data (SIMD) fashion. However, DRAM rows' large and rigid granularity limit the effectiveness and applicability of PUD in three ways. First, since applications have varying degrees of SIMD paralleli…
▽ More
Processing-using-DRAM (PUD) is a processing-in-memory (PIM) approach that uses a DRAM array's massive internal parallelism to execute very-wide data-parallel operations, in a single-instruction multiple-data (SIMD) fashion. However, DRAM rows' large and rigid granularity limit the effectiveness and applicability of PUD in three ways. First, since applications have varying degrees of SIMD parallelism, PUD execution often leads to underutilization, throughput loss, and energy waste. Second, most PUD architectures are limited to the execution of parallel map operations. Third, the need to feed the wide DRAM row with tens of thousands of data elements combined with the lack of adequate compiler support for PUD systems create a programmability barrier.
Our goal is to design a flexible PUD system that overcomes the limitations caused by the large and rigid granularity of PUD. To this end, we propose MIMDRAM, a hardware/software co-designed PUD system that introduces new mechanisms to allocate and control only the necessary resources for a given PUD operation. The key idea of MIMDRAM is to leverage fine-grained DRAM (i.e., the ability to independently access smaller segments of a large DRAM row) for PUD computation. MIMDRAM exploits this key idea to enable a multiple-instruction multiple-data (MIMD) execution model in each DRAM subarray.
We evaluate MIMDRAM using twelve real-world applications and 495 multi-programmed application mixes. Our evaluation shows that MIMDRAM provides 34x the performance, 14.3x the energy efficiency, 1.7x the throughput, and 1.3x the fairness of a state-of-the-art PUD framework, along with 30.6x and 6.8x the energy efficiency of a high-end CPU and GPU, respectively. MIMDRAM adds small area cost to a DRAM chip (1.11%) and CPU die (0.6%).
△ Less
Submitted 3 March, 2024; v1 submitted 29 February, 2024;
originally announced February 2024.
-
CoMeT: Count-Min-Sketch-based Row Tracking to Mitigate RowHammer at Low Cost
Authors:
F. Nisa Bostanci,
Ismail Emir Yuksel,
Ataberk Olgun,
Konstantinos Kanellopoulos,
Yahya Can Tugrul,
A. Giray Yaglikci,
Mohammad Sadrosadati,
Onur Mutlu
Abstract:
We propose a new RowHammer mitigation mechanism, CoMeT, that prevents RowHammer bitflips with low area, performance, and energy costs in DRAM-based systems at very low RowHammer thresholds. The key idea of CoMeT is to use low-cost and scalable hash-based counters to track DRAM row activations. CoMeT uses the Count-Min Sketch technique that maps each DRAM row to a group of counters, as uniquely as…
▽ More
We propose a new RowHammer mitigation mechanism, CoMeT, that prevents RowHammer bitflips with low area, performance, and energy costs in DRAM-based systems at very low RowHammer thresholds. The key idea of CoMeT is to use low-cost and scalable hash-based counters to track DRAM row activations. CoMeT uses the Count-Min Sketch technique that maps each DRAM row to a group of counters, as uniquely as possible, using multiple hash functions. When a DRAM row is activated, CoMeT increments the counters mapped to that DRAM row. Because the mapping from DRAM rows to counters is not completely unique, activating one row can increment one or more counters mapped to another row. Thus, CoMeT may overestimate, but never underestimates, a DRAM row's activation count. This property of CoMeT allows it to securely prevent RowHammer bitflips while properly configuring its hash functions reduces overestimations. As a result, CoMeT 1) implements substantially fewer counters than the number of DRAM rows in a DRAM bank and 2) does not significantly overestimate a DRAM row's activation count.
Our comprehensive evaluations show that CoMeT prevents RowHammer bitflips with an average performance overhead of only 4.01% across 61 benign single-core workloads for a very low RowHammer threshold of 125, normalized to a system with no RowHammer mitigation. CoMeT achieves a good trade-off between performance, energy, and area overheads. Compared to the best-performing state-of-the-art mitigation, CoMeT requires 74.2x less area overhead at the RowHammer threshold 125 and incurs a small performance overhead on average for all RowHammer thresholds. Compared to the best-performing low-area-cost mechanism, at a very low RowHammer threshold of 125, CoMeT improves performance by up to 39.1% while incurring a similar area overhead. CoMeT is openly and freely available at https://github.com/CMU-SAFARI/CoMeT.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis
Authors:
Ismail Emir Yuksel,
Yahya Can Tugrul,
Ataberk Olgun,
F. Nisa Bostanci,
A. Giray Yaglikci,
Geraldo F. Oliveira,
Haocong Luo,
Juan Gómez-Luna,
Mohammad Sadrosadati,
Onur Mutlu
Abstract:
Processing-using-DRAM (PuD) is an emerging paradigm that leverages the analog operational properties of DRAM circuitry to enable massively parallel in-DRAM computation. PuD has the potential to reduce or eliminate costly data movement between processing elements and main memory. Prior works experimentally demonstrate three-input MAJ (MAJ3) and two-input AND and OR operations in commercial off-the-…
▽ More
Processing-using-DRAM (PuD) is an emerging paradigm that leverages the analog operational properties of DRAM circuitry to enable massively parallel in-DRAM computation. PuD has the potential to reduce or eliminate costly data movement between processing elements and main memory. Prior works experimentally demonstrate three-input MAJ (MAJ3) and two-input AND and OR operations in commercial off-the-shelf (COTS) DRAM chips. Yet, demonstrations on COTS DRAM chips do not provide a functionally complete set of operations.
We experimentally demonstrate that COTS DRAM chips are capable of performing 1) functionally-complete Boolean operations: NOT, NAND, and NOR and 2) many-input (i.e., more than two-input) AND and OR operations. We present an extensive characterization of new bulk bitwise operations in 256 off-the-shelf modern DDR4 DRAM chips. We evaluate the reliability of these operations using a metric called success rate: the fraction of correctly performed bitwise operations. Among our 19 new observations, we highlight four major results. First, we can perform the NOT operation on COTS DRAM chips with a 98.37% success rate on average. Second, we can perform up to 16-input NAND, NOR, AND, and OR operations on COTS DRAM chips with high reliability (e.g., 16-input NAND, NOR, AND, and OR with an average success rate of 94.94%, 95.87%, 94.94%, and 95.85%, respectively). Third, data pattern only slightly affects bitwise operations. Our results show that executing NAND, NOR, AND, and OR operations with random data patterns decreases the success rate compared to all logic-1/logic-0 patterns by 1.39%, 1.97%, 1.43%, and 1.98%, respectively. Fourth, bitwise operations are highly resilient to temperature changes, with small success rate fluctuations of at most 1.66% when the temperature is increased from 50C to 95C. We open-source our infrastructure at https://github.com/CMU-SAFARI/FCDRAM
△ Less
Submitted 21 April, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
PULSAR: Simultaneous Many-Row Activation for Reliable and High-Performance Computing in Off-the-Shelf DRAM Chips
Authors:
Ismail Emir Yuksel,
Yahya Can Tugrul,
F. Nisa Bostanci,
Abdullah Giray Yaglikci,
Ataberk Olgun,
Geraldo F. Oliveira,
Melina Soysal,
Haocong Luo,
Juan Gomez Luna,
Mohammad Sadrosadati,
Onur Mutlu
Abstract:
Data movement between the processor and the main memory is a first-order obstacle against improving performance and energy efficiency in modern systems. To address this obstacle, Processing-using-Memory (PuM) is a promising approach where bulk-bitwise operations are performed leveraging intrinsic analog properties within the DRAM array and massive parallelism across DRAM columns. Unfortunately, 1)…
▽ More
Data movement between the processor and the main memory is a first-order obstacle against improving performance and energy efficiency in modern systems. To address this obstacle, Processing-using-Memory (PuM) is a promising approach where bulk-bitwise operations are performed leveraging intrinsic analog properties within the DRAM array and massive parallelism across DRAM columns. Unfortunately, 1) modern off-the-shelf DRAM chips do not officially support PuM operations, and 2) existing techniques of performing PuM operations on off-the-shelf DRAM chips suffer from two key limitations. First, these techniques have low success rates, i.e., only a small fraction of DRAM columns can correctly execute PuM operations because they operate beyond manufacturer-recommended timing constraints, causing these operations to be highly susceptible to noise and process variation. Second, these techniques have limited compute primitives, preventing them from fully leveraging parallelism across DRAM columns and thus hindering their performance benefits.
We propose PULSAR, a new technique to enable high-success-rate and high-performance PuM operations in off-the-shelf DRAM chips. PULSAR leverages our new observation that a carefully crafted sequence of DRAM commands simultaneously activates up to 32 DRAM rows. PULSAR overcomes the limitations of existing techniques by 1) replicating the input data to improve the success rate and 2) enabling new bulk bitwise operations (e.g., many-input majority, Multi-RowInit, and Bulk-Write) to improve the performance.
Our analysis on 120 off-the-shelf DDR4 chips from two major manufacturers shows that PULSAR achieves a 24.18% higher success rate and 121% higher performance over seven arithmetic-logic operations compared to FracDRAM, a state-of-the-art off-the-shelf DRAM-based PuM technique.
△ Less
Submitted 18 March, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources
Authors:
Konstantinos Kanellopoulos,
Hong Chul Nam,
F. Nisa Bostanci,
Rahul Bera,
Mohammad Sadrosadati,
Rakesh Kumar,
Davide-Basilio Bartolini,
Onur Mutlu
Abstract:
Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), an…
▽ More
Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs). We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache replacement policy prioritizes keeping TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries. Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, and (iii) incurs very small area and power overheads on a modern high-end CPU.
△ Less
Submitted 5 January, 2024; v1 submitted 6 October, 2023;
originally announced October 2023.
-
Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator
Authors:
Haocong Luo,
Yahya Can Tuğrul,
F. Nisa Bostancı,
Ataberk Olgun,
A. Giray Yağlıkçı,
Onur Mutlu
Abstract:
We present Ramulator 2.0, a highly modular and extensible DRAM simulator that enables rapid and agile implementation and evaluation of design changes in the memory controller and DRAM to meet the increasing research effort in improving the performance, security, and reliability of memory systems. Ramulator 2.0 abstracts and models key components in a DRAM-based memory system and their interactions…
▽ More
We present Ramulator 2.0, a highly modular and extensible DRAM simulator that enables rapid and agile implementation and evaluation of design changes in the memory controller and DRAM to meet the increasing research effort in improving the performance, security, and reliability of memory systems. Ramulator 2.0 abstracts and models key components in a DRAM-based memory system and their interactions into shared interfaces and independent implementations. Doing so enables easy modification and extension of the modeled functions of the memory controller and DRAM in Ramulator 2.0. The DRAM specification syntax of Ramulator 2.0 is concise and human-readable, facilitating easy modifications and extensions. Ramulator 2.0 implements a library of reusable templated lambda functions to model the functionalities of DRAM commands to simplify the implementation of new DRAM standards, including DDR5, LPDDR5, HBM3, and GDDR6. We showcase Ramulator 2.0's modularity and extensibility by implementing and evaluating a wide variety of RowHammer mitigation techniques that require different memory controller design changes. These techniques are added modularly as separate implementations without changing any code in the baseline memory controller implementation. Ramulator 2.0 is rigorously validated and maintains a fast simulation speed compared to existing cycle-accurate DRAM simulators. Ramulator 2.0 is open-sourced under the permissive MIT license at https://github.com/CMU-SAFARI/ramulator2
△ Less
Submitted 28 November, 2023; v1 submitted 21 August, 2023;
originally announced August 2023.
-
TuRaN: True Random Number Generation Using Supply Voltage Underscaling in SRAMs
Authors:
İsmail Emir Yüksel,
Ataberk Olgun,
Behzad Salami,
F. Nisa Bostancı,
Yahya Can Tuğrul,
A. Giray Yağlıkçı,
Nika Mansouri Ghiasi,
Onur Mutlu,
Oğuz Ergin
Abstract:
Prior works propose SRAM-based TRNGs that extract entropy from SRAM arrays. SRAM arrays are widely used in a majority of specialized or general-purpose chips that perform the computation to store data inside the chip. Thus, SRAM-based TRNGs present a low-cost alternative to dedicated hardware TRNGs. However, existing SRAM-based TRNGs suffer from 1) low TRNG throughput, 2) high energy consumption,…
▽ More
Prior works propose SRAM-based TRNGs that extract entropy from SRAM arrays. SRAM arrays are widely used in a majority of specialized or general-purpose chips that perform the computation to store data inside the chip. Thus, SRAM-based TRNGs present a low-cost alternative to dedicated hardware TRNGs. However, existing SRAM-based TRNGs suffer from 1) low TRNG throughput, 2) high energy consumption, 3) high TRNG latency, and 4) the inability to generate true random numbers continuously, which limits the application space of SRAM-based TRNGs. Our goal in this paper is to design an SRAM-based TRNG that overcomes these four key limitations and thus, extends the application space of SRAM-based TRNGs. To this end, we propose TuRaN, a new high-throughput, energy-efficient, and low-latency SRAM-based TRNG that can sustain continuous operation. TuRaN leverages the key observation that accessing SRAM cells results in random access failures when the supply voltage is reduced below the manufacturer-recommended supply voltage. TuRaN generates random numbers at high throughput by repeatedly accessing SRAM cells with reduced supply voltage and post-processing the resulting random faults using the SHA-256 hash function. To demonstrate the feasibility of TuRaN, we conduct SPICE simulations on different process nodes and analyze the potential of access failure for use as an entropy source. We verify and support our simulation results by conducting real-world experiments on two commercial off-the-shelf FPGA boards. We evaluate the quality of the random numbers generated by TuRaN using the widely-adopted NIST standard randomness tests and observe that TuRaN passes all tests. TuRaN generates true random numbers with (i) an average (maximum) throughput of 1.6Gbps (1.812Gbps), (ii) 0.11nJ/bit energy consumption, and (iii) 278.46us latency.
△ Less
Submitted 20 November, 2022;
originally announced November 2022.
-
Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture
Authors:
Ataberk Olgun,
F. Nisa Bostanci,
Geraldo F. Oliveira,
Yahya Can Tugrul,
Rahul Bera,
A. Giray Yaglikci,
Hasan Hassan,
Oguz Ergin,
Onur Mutlu
Abstract:
We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling fine-grained DRAM data transfers and DRAM row activation. Sectored DRAM leverages two key ideas to enable fine-grained data transfers and row activation at low chip area cost. First, a cache block transfer between main memory and the memory controller happens in a fixed number of clock cycles where…
▽ More
We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling fine-grained DRAM data transfers and DRAM row activation. Sectored DRAM leverages two key ideas to enable fine-grained data transfers and row activation at low chip area cost. First, a cache block transfer between main memory and the memory controller happens in a fixed number of clock cycles where only a small portion of the cache block (a word) is transferred in each cycle. Sectored DRAM augments the memory controller and the DRAM chip to execute cache block transfers in a variable number of clock cycles based on the workload access pattern with minor modifications to the memory controller's and the DRAM chip's circuitry. Second, a large DRAM row, by design, is already partitioned into smaller independent physically isolated regions. Sectored DRAM provides the memory controller with the ability to activate each such region based on the workload access pattern via small modifications to the DRAM chip's array access circuitry. Activating smaller regions of a large row relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster.
Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM's DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM's DRAM chip area overhead is 1.7% the area of a modern DDR4 chip. We hope and believe that Sectored DRAM's ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.
△ Less
Submitted 9 June, 2024; v1 submitted 27 July, 2022;
originally announced July 2022.
-
DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators
Authors:
F. Nisa Bostancı,
Ataberk Olgun,
Lois Orosa,
A. Giray Yağlıkçı,
Jeremie S. Kim,
Hasan Hassan,
Oğuz Ergin,
Onur Mutlu
Abstract:
Random number generation is an important task in a wide variety of critical applications including cryptographic algorithms, scientific simulations, and industrial testing tools. True Random Number Generators (TRNGs) produce truly random data by sampling a physical entropy source that typically requires custom hardware and suffers from long latency. To enable high-bandwidth and low-latency TRNGs o…
▽ More
Random number generation is an important task in a wide variety of critical applications including cryptographic algorithms, scientific simulations, and industrial testing tools. True Random Number Generators (TRNGs) produce truly random data by sampling a physical entropy source that typically requires custom hardware and suffers from long latency. To enable high-bandwidth and low-latency TRNGs on commodity devices, recent works propose TRNGs that use DRAM as an entropy source. Although prior works demonstrate promising DRAM-based TRNGs, integration of such mechanisms into real systems poses challenges. We identify three challenges for using DRAM-based TRNGs in current systems: (1) generating random numbers can degrade system performance by slowing down concurrently-running applications due to the interference between RNG and regular memory operations in the memory controller (i.e., RNG interference), (2) this RNG interference can degrade system fairness by unfairly prioritizing applications that intensively use random numbers (i.e., RNG applications), and (3) RNG applications can experience significant slowdowns due to the high RNG latency. We propose DR-STRaNGe, an end-to-end system design for DRAM-based TRNGs that (1) reduces the RNG interference by separating RNG requests from regular requests in the memory controller, (2) improves the system fairness with an RNG-aware memory request scheduler, and (3) hides the large TRNG latencies using a random number buffering mechanism with a new DRAM idleness predictor that accurately identifies idle DRAM periods. We evaluate DR-STRaNGe using a set of 186 multiprogrammed workloads. Compared to an RNG-oblivious baseline system, DR-STRaNGe improves the average performance of non-RNG and RNG applications by 17.9% and 25.1%, respectively. DR-STRaNGe improves average system fairness by 32.1% and reduces average energy consumption by 21%.
△ Less
Submitted 6 June, 2022; v1 submitted 4 January, 2022;
originally announced January 2022.