SMART: A Heterogeneous Scratchpad Memory Architecture for Superconductor SFQ-based Systolic CNN Accelerators
Authors:
Farzaneh Zokaee,
Lei Jiang
Abstract:
Ultra-fast \& low-power superconductor single-flux-quantum (SFQ)-based CNN systolic accelerators are built to enhance the CNN inference throughput. However, shift-register (SHIFT)-based scratchpad memory (SPM) arrays prevent a SFQ CNN accelerator from exceeding 40\% of its peak throughput, due to the lack of random access capability. This paper first documents our study of a variety of cryogenic m…
▽ More
Ultra-fast \& low-power superconductor single-flux-quantum (SFQ)-based CNN systolic accelerators are built to enhance the CNN inference throughput. However, shift-register (SHIFT)-based scratchpad memory (SPM) arrays prevent a SFQ CNN accelerator from exceeding 40\% of its peak throughput, due to the lack of random access capability. This paper first documents our study of a variety of cryogenic memory technologies, including Vortex Transition Memory (VTM), Josephson-CMOS SRAM, MRAM, and Superconducting Nanowire Memory, during which we found that none of the aforementioned technologies made a SFQ CNN accelerator achieve high throughput, small area, and low power simultaneously. Second, we present a heterogeneous SPM architecture, SMART, composed of SHIFT arrays and a random access array to improve the inference throughput of a SFQ CNN systolic accelerator. Third, we propose a fast, low-power and dense pipelined random access CMOS-SFQ array by building SFQ passive-transmission-line-based H-Trees that connect CMOS sub-banks. Finally, we create an ILP-based compiler to deploy CNN models on SMART. Experimental results show that, with the same chip area overhead, compared to the latest SHIFT-based SFQ CNN accelerator, SMART improves the inference throughput by $3.9\times$ ($2.2\times$), and reduces the inference energy by $86\%$ ($71\%$) when inferring a single image (a batch of images).
△ Less
Submitted 2 September, 2021;
originally announced September 2021.
EXMA: A Genomics Accelerator for Exact-Matching
Authors:
Lei Jiang,
Farzaneh Zokaee
Abstract:
Genomics is the foundation of precision medicine, global food security and virus surveillance. Exact-match is one of the most essential operations widely used in almost every step of genomics such as alignment, assembly, annotation, and compression. Modern genomics adopts Ferragina-Manzini Index (FM-Index) augmenting space-efficient Burrows-Wheeler transform (BWT) with additional data structures t…
▽ More
Genomics is the foundation of precision medicine, global food security and virus surveillance. Exact-match is one of the most essential operations widely used in almost every step of genomics such as alignment, assembly, annotation, and compression. Modern genomics adopts Ferragina-Manzini Index (FM-Index) augmenting space-efficient Burrows-Wheeler transform (BWT) with additional data structures to permit ultra-fast exact-match operations. However, FM-Index is notorious for its poor spatial locality and random memory access pattern. Prior works create GPU-, FPGA-, ASIC- and even process-in-memory (PIM)-based accelerators to boost FM-Index search throughput. Though they achieve the state-of-the-art FM-Index search throughput, the same as all prior conventional accelerators, FM-Index PIMs process only one DNA symbol after each DRAM row activation, thereby suffering from poor memory bandwidth utilization.
In this paper, we propose a hardware accelerator, EXMA, to enhance FM-Index search throughput. We first create a novel EXMA table with a multi-task-learning (MTL)-based index to process multiple DNA symbols with each DRAM row activation. We then build an accelerator to search over an EXMA table. We propose 2-stage scheduling to increase the cache hit rate of our accelerator. We introduce dynamic page policy to improve the row buffer hit rate of DRAM main memory. We also present CHAIN compression to reduce the data structure size of EXMA tables. Compared to state-of-the-art FM-Index PIMs, EXMA improves search throughput by $4.9\times$, and enhances search throughput per Watt by $4.8\times$.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.