-
An Asynchronous and Low-Power True Random Number Generator using STT-MTJ
Authors:
Ben Perach,
Shahar Kvatinsky
Abstract:
The emerging Spin Transfer Torque Magnetic Tunnel Junction (STT-MTJ) technology exhibits interesting stochastic behavior combined with small area and low operation energy. It is, therefore, a promising technology for security applications, specifically the generation of random numbers. In this paper, STT-MTJ is used to construct an asynchronous true random number generator (TRNG) with low power an…
▽ More
The emerging Spin Transfer Torque Magnetic Tunnel Junction (STT-MTJ) technology exhibits interesting stochastic behavior combined with small area and low operation energy. It is, therefore, a promising technology for security applications, specifically the generation of random numbers. In this paper, STT-MTJ is used to construct an asynchronous true random number generator (TRNG) with low power and a high entropy rate. The asynchronous design enables decoupling of the random number generation from the system clock, allowing it to be embedded in low-power devices. The proposed TRNG is evaluated by a numerical simulation, using the Landau-Lifshitz-Gilbert (LLG) equation as the model of the STT-MTJ devices. Design considerations, attack analysis, and process variation are discussed and evaluated. We show that our design is robust to process variation, achieving an entropy generating rate between 99.7Mbps and 127.8Mbps with 6-7.7 pJ per bit for 90% of the instances.
△ Less
Submitted 26 July, 2023;
originally announced July 2023.
-
Accelerating Relational Database Analytical Processing with Bulk-Bitwise Processing-in-Memory
Authors:
Ben Perach,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Online Analytical Processing (OLAP) for relational databases is a business decision support application. The application receives queries about the business database, usually requesting to summarize many database records, and produces few results. Existing OLAP requires transferring a large amount of data between the memory and the CPU, having a few operations per datum, and producing a small outp…
▽ More
Online Analytical Processing (OLAP) for relational databases is a business decision support application. The application receives queries about the business database, usually requesting to summarize many database records, and produces few results. Existing OLAP requires transferring a large amount of data between the memory and the CPU, having a few operations per datum, and producing a small output. Hence, OLAP is a good candidate for processing-in-memory (PIM), where computation is performed where the data is stored, thus accelerating applications by reducing data movement between the memory and CPU. In particular, bulk-bitwise PIM, where the memory array is a bit-vector processing unit, seems a good match for OLAP. With the extensive inherent parallelism and minimal data movement of bulk-bitwise PIM, OLAP applications can process the entire database in parallel in memory, transferring only the results to the CPU. This paper shows a full stack adaptation of a bulk-bitwise PIM, from compiling SQL to hardware implementation, for supporting OLAP applications. Evaluating the Star Schema Benchmark (SSB), bulk-bitwise PIM achieves a 4.65X speedup over Monet-DB, a standard database system.
△ Less
Submitted 2 July, 2023;
originally announced July 2023.
-
ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory
Authors:
Marcel Khalifa,
Barak Hoffer,
Orian Leitersdorf,
Robert Hanhan,
Ben Perach,
Leonid Yavits,
Shahar Kvatinsky
Abstract:
DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive…
▽ More
DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive processing-in-memory (PIM). We enable efficient and high-quality classification by uniting the filter and search stages within a single algorithm. Specifically, we propose a custom filtering technique that drastically narrows the search space and a search approach that facilitates approximate string matching through a distance function. ClaPIM is the first PIM architecture for scalable approximate string matching that benefits from the high density of memristive crossbar arrays and the massive computational parallelism of PIM. Compared with Kraken2, a state-of-the-art software classifier, ClaPIM provides significantly higher classification quality (up to 20x improvement in F1 score) and also demonstrates a 1.8x throughput improvement. Compared with EDAM, a recently-proposed SRAM-based accelerator that is restricted to small datasets, we observe both a 30.4x improvement in normalized throughput per area and a 7% increase in classification precision.
△ Less
Submitted 5 November, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Enabling Relational Database Analytical Processing in Bulk-Bitwise Processing-In-Memory
Authors:
Ben Perach,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Bulk-bitwise processing-in-memory (PIM), an emerging computational paradigm utilizing memory arrays as computational units, has been shown to benefit database applications. This paper demonstrates how GROUP-BY and JOIN, database operations not supported by previous works, can be performed efficiently in bulk-bitwise PIM for relational database analytical processing. We extend the gem5 simulator an…
▽ More
Bulk-bitwise processing-in-memory (PIM), an emerging computational paradigm utilizing memory arrays as computational units, has been shown to benefit database applications. This paper demonstrates how GROUP-BY and JOIN, database operations not supported by previous works, can be performed efficiently in bulk-bitwise PIM for relational database analytical processing. We extend the gem5 simulator and evaluated our hardware modifications on the Star Schema Benchmark. We show that compared to previous works, our modifications improve (on average) execution time by 1.83X, energy by 4.31X, and the system's lifetime by 3.21X. We also achieved a speedup of 4.65X over MonetDB, a modern state-of-the-art in-memory database.
△ Less
Submitted 2 November, 2023; v1 submitted 3 February, 2023;
originally announced February 2023.
-
On Consistency for Bulk-Bitwise Processing-in-Memory
Authors:
Ben Perach,
Ronny Ronnen,
Shahar Kvatinsky
Abstract:
Processing-in-memory (PIM) architectures allow software to explicitly initiate computation in the memory. This effectively makes PIM operations a new class of memory operations, alongside standard memory operations (e.g., load, store). For software correctness, it is crucial to have ordering rules for a PIM operation with other PIM operations and other memory operations, i.e., a consistency model…
▽ More
Processing-in-memory (PIM) architectures allow software to explicitly initiate computation in the memory. This effectively makes PIM operations a new class of memory operations, alongside standard memory operations (e.g., load, store). For software correctness, it is crucial to have ordering rules for a PIM operation with other PIM operations and other memory operations, i.e., a consistency model that takes into account PIM operations is vital. To the best of our knowledge, little attention to PIM operation consistency has been given in existing works. In this paper, we focus on a specific PIM approach, named bulk-bitwise PIM. In bulk-bitwise PIM, large bitwise operations are performed directly and stored in the memory array. We show that previous solutions for the related topic of maintaining coherency of bulk-bitwise PIM have broken the host native consistency model and prevent any guaranteed correctness. As a solution, we propose and evaluate four consistency models for bulk-bitwise PIM, from strict to relaxed. Our designs also preserve coherency between PIM and the host processor. Evaluating the proposed designs' performance with a gem5 simulation, using the YCSB short-range scan benchmark and TPC-H queries, shows that the run time overhead of guaranteeing correctness is at most $6\%$, and in many cases the run time is even improved. The hardware overhead of our design is less than $0.22\%$.
△ Less
Submitted 7 December, 2022; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Understanding Bulk-Bitwise Processing In-Memory Through Database Analytics
Authors:
Ben Perach,
Ronny Ronen,
Benny Kimelfeld,
Shahar Kvatinsky
Abstract:
Bulk-bitwise processing-in-memory (PIM), where large bitwise operations are performed in parallel by the memory array itself, is an emerging form of computation with the potential to mitigate the memory wall problem. This paper examines the capabilities of bulk-bitwise PIM by constructing PIMDB, a fully-digital system based on memristive stateful logic, utilizing and focusing on in-memory bulk-bit…
▽ More
Bulk-bitwise processing-in-memory (PIM), where large bitwise operations are performed in parallel by the memory array itself, is an emerging form of computation with the potential to mitigate the memory wall problem. This paper examines the capabilities of bulk-bitwise PIM by constructing PIMDB, a fully-digital system based on memristive stateful logic, utilizing and focusing on in-memory bulk-bitwise operations, designed to accelerate a real-life workload: analytical processing of relational databases. We introduce a host processor programming model to support bulk-bitwise PIM in virtual memory, develop techniques to efficiently perform in-memory filtering and aggregation operations, and adapt the application data set into the memory. To understand bulk-bitwise PIM, we compare it to an equivalent in-memory database on the same host system. We show that bulk-bitwise PIM substantially lowers the number of required memory read operations, thus accelerating TPC-H filter operations by 1.6$\times$--18$\times$ and full queries by 56$\times$--608$\times$, while reducing the energy consumption by 1.7$\times$--18.6$\times$ and 0.81$\times$--12$\times$ for these benchmarks, respectively. Our extensive evaluation uses the gem5 full-system simulation environment. The simulations also evaluate cell endurance, showing that the required endurance is within the range of existing endurance of RRAM devices.
△ Less
Submitted 26 September, 2023; v1 submitted 20 March, 2022;
originally announced March 2022.
-
The Bitlet Model: A Parameterized Analytical Model to Compare PIM and CPU Systems
Authors:
Ronny Ronen,
Adi Eliahu,
Orian Leitersdorf,
Natan Peled,
Kunal Korgaonkar,
Anupam Chattopadhyay,
Ben Perach,
Shahar Kvatinsky
Abstract:
Nowadays, data-intensive applications are gaining popularity and, together with this trend, processing-in-memory (PIM)-based systems are being given more attention and have become more relevant. This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the af…
▽ More
Nowadays, data-intensive applications are gaining popularity and, together with this trend, processing-in-memory (PIM)-based systems are being given more attention and have become more relevant. This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the affinity of workloads for PIM as opposed to traditional computing. The tool uncovers interesting tradeoffs between, mainly, the PIM computation complexity (cycles required to perform a computation through PIM), the amount of memory used for PIM, the system memory bandwidth, and the data transfer size. Despite its simplicity, the model reveals new insights when applied to real-life examples. The model is demonstrated for several synthetic examples and then applied to explore the influence of different parameters on two systems - IMAGING and FloatPIM. Based on the demonstrations, insights about PIM and its combination with CPU are concluded.
△ Less
Submitted 21 July, 2021;
originally announced July 2021.
-
Efficient Error-Correcting-Code Mechanism for High-Throughput Memristive Processing-in-Memory
Authors:
Orian Leitersdorf,
Ben Perach,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Inefficient data transfer between computation and memory inspired emerging processing-in-memory (PIM) technologies. Many PIM solutions enable storage and processing using memristors in a crossbar-array structure, with techniques such as memristor-aided logic (MAGIC) used for computation. This approach provides highly-paralleled logic computation with minimal data movement. However, memristors are…
▽ More
Inefficient data transfer between computation and memory inspired emerging processing-in-memory (PIM) technologies. Many PIM solutions enable storage and processing using memristors in a crossbar-array structure, with techniques such as memristor-aided logic (MAGIC) used for computation. This approach provides highly-paralleled logic computation with minimal data movement. However, memristors are vulnerable to soft errors and standard error-correcting-code (ECC) techniques are difficult to implement without moving data outside the memory. We propose a novel technique for efficient ECC implementation along diagonals to support reliable computation inside the memory without explicitly reading the data. Our evaluation demonstrates an improvement of over eight orders of magnitude in reliability (mean time to failure) for an increase of about 26% in computation latency.
△ Less
Submitted 10 May, 2021;
originally announced May 2021.
-
Training of Quantized Deep Neural Networks using a Magnetic Tunnel Junction-Based Synapse
Authors:
Tzofnat Greenberg Toledo,
Ben Perach,
Itay Hubara,
Daniel Soudry,
Shahar Kvatinsky
Abstract:
Quantized neural networks (QNNs) are being actively researched as a solution for the computational complexity and memory intensity of deep neural networks. This has sparked efforts to develop algorithms that support both inference and training with quantized weight and activation values, without sacrificing accuracy. A recent example is the GXNOR framework for stochastic training of ternary (TNN)…
▽ More
Quantized neural networks (QNNs) are being actively researched as a solution for the computational complexity and memory intensity of deep neural networks. This has sparked efforts to develop algorithms that support both inference and training with quantized weight and activation values, without sacrificing accuracy. A recent example is the GXNOR framework for stochastic training of ternary (TNN) and binary (BNN) neural networks. In this paper, we show how magnetic tunnel junction (MTJ) devices can be used to support QNN training. We introduce a novel hardware synapse circuit that uses the MTJ stochastic behavior to support the quantize update. The proposed circuit enables processing near memory (PNM) of QNN training, which subsequently reduces data movement. We simulated MTJ-based stochastic training of a TNN over the MNIST, SVHN, and CIFAR10 datasets and achieved an accuracy of 98.61%, 93.99% and 82.71%, respectively (less than 1% degradation compared to the GXNOR algorithm). We evaluated the synapse array performance potential and showed that the proposed synapse circuit can train ternary networks in situ, with 18.3TOPs/W for feedforward and 3TOPs/W for weight update.
△ Less
Submitted 29 May, 2022; v1 submitted 29 December, 2019;
originally announced December 2019.