-
Efficient Real-Time Selective Genome Sequencing on Resource-Constrained Devices
Authors:
Po Jui Shih,
Hassaan Saadat,
Sri Parameswaran,
Hasindu Gamaarachchi
Abstract:
Third-generation nanopore sequencers offer a feature called selective sequencing or 'Read Until' that allows genomic reads to be analyzed in real-time and abandoned halfway, if not belonging to a genomic region of 'interest'. This selective sequencing opens the door to important applications such as rapid and low-cost genetic tests. The latency in analyzing should be as low as possible for selecti…
▽ More
Third-generation nanopore sequencers offer a feature called selective sequencing or 'Read Until' that allows genomic reads to be analyzed in real-time and abandoned halfway, if not belonging to a genomic region of 'interest'. This selective sequencing opens the door to important applications such as rapid and low-cost genetic tests. The latency in analyzing should be as low as possible for selective sequencing to be effective so that unnecessary reads can be rejected as early as possible. However, existing methods that employ subsequence Dynamic Time Warping (sDTW) algorithm for this problem are too computationally intensive that a massive workstation with dozens of CPU cores still struggles to keep up with the data rate of a mobile phone-sized MinION sequencer. In this paper, we present Hardware Accelerated Read Until (HARU), a resource-efficient hardware-software co-design-based method that exploits a low-cost and portable heterogeneous MPSoC platform with on-chip FPGA to accelerate the sDTW-based Read Until algorithm. Experimental results show that HARU on a Xilinx FPGA embedded with a 4-core ARM processor is around 2.5X faster than a highly optimized multi-threaded software version (around 85X faster than the existing unoptimized multi-threaded software) running on a sophisticated server with 36-core Intel Xeon processor for a SARS-CoV-2 dataset. The energy consumption of HARU is two orders of magnitudes lower than the same application executing on the 36-core server. Source code for HARU sDTW module is available as open-source at https://github.com/beebdev/HARU and an example application that utilises HARU is at https://github.com/beebdev/sigfish-haru.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference
Authors:
Jing Gong,
Hassaan Saadat,
Hasindu Gamaarachchi,
Haris Javaid,
Xiaobo Sharon Hu,
Sri Parameswaran
Abstract:
Edge training of Deep Neural Networks (DNNs) is a desirable goal for continuous learning; however, it is hindered by the enormous computational power required by training. Hardware approximate multipliers have shown their effectiveness for gaining resource-efficiency in DNN inference accelerators; however, training with approximate multipliers is largely unexplored. To build resource efficient acc…
▽ More
Edge training of Deep Neural Networks (DNNs) is a desirable goal for continuous learning; however, it is hindered by the enormous computational power required by training. Hardware approximate multipliers have shown their effectiveness for gaining resource-efficiency in DNN inference accelerators; however, training with approximate multipliers is largely unexplored. To build resource efficient accelerators with approximate multipliers supporting DNN training, a thorough evaluation of training convergence and accuracy for different DNN architectures and different approximate multipliers is needed. This paper presents ApproxTrain, an open-source framework that allows fast evaluation of DNN training and inference using simulated approximate multipliers. ApproxTrain is as user-friendly as TensorFlow (TF) and requires only a high-level description of a DNN architecture along with C/C++ functional models of the approximate multiplier. We improve the speed of the simulation at the multiplier level by using a novel LUT-based approximate floating-point (FP) multiplier simulator on GPU (AMSim). ApproxTrain leverages CUDA and efficiently integrates AMSim into the TensorFlow library, in order to overcome the absence of native hardware approximate multiplier in commercial GPUs. We use ApproxTrain to evaluate the convergence and accuracy of DNN training with approximate multipliers for small and large datasets (including ImageNet) using LeNets and ResNets architectures. The evaluations demonstrate similar convergence behavior and negligible change in test accuracy compared to FP32 and bfloat16 multipliers. Compared to CPU-based approximate multiplier simulations in training and inference, the GPU-accelerated ApproxTrain is more than 2500x faster. Based on highly optimized closed-source cuDNN/cuBLAS libraries with native hardware multipliers, the original TensorFlow is only 8x faster than ApproxTrain.
△ Less
Submitted 23 September, 2022; v1 submitted 9 September, 2022;
originally announced September 2022.
-
Computer Architecture-Aware Optimisation of DNA Analysis Systems
Authors:
Hasindu Gamaarachchi
Abstract:
DNA sequencing is revolutionising the field of medicine. DNA sequencers, the machines which perform DNA sequencing, have evolved from the size of a fridge to that of a mobile phone over the last two decades. The cost of sequencing a human genome also has reduced from billions of dollars to hundreds of dollars. Despite these improvements, DNA sequencers output hundreds or thousands of gigabytes of…
▽ More
DNA sequencing is revolutionising the field of medicine. DNA sequencers, the machines which perform DNA sequencing, have evolved from the size of a fridge to that of a mobile phone over the last two decades. The cost of sequencing a human genome also has reduced from billions of dollars to hundreds of dollars. Despite these improvements, DNA sequencers output hundreds or thousands of gigabytes of data that must be analysed on computers to discover meaningful information with biological implications. Unfortunately, the analysis techniques have not kept the pace with rapidly improving sequencing technologies. Consequently, even today, the process of DNA analysis is performed on high-performance computers, just as it was a couple of decades ago. Such high-performance computers are not portable. Consequently, the full utility of an ultra-portable sequencer for sequencing in-the-field or at the point-of-care is limited by the lack of portable lightweight analytic techniques. This thesis proposes computer architecture-aware optimisation of DNA analysis software. DNA analysis software is inevitably convoluted due to the complexity associated with biological data. Modern computer architectures are also complex. Performing architecture-aware optimisations requires the synergistic use of knowledge from both domains, (i.e, DNA sequence analysis and computer architecture). This thesis aims to draw the two domains together. In this thesis, gold-standard DNA sequence analysis workflows are systematically examined for algorithmic components that cause performance bottlenecks. Identified bottlenecks are resolved through architecture-aware optimisations at different levels, i.e., memory, cache, register and processor. The optimised software tools are used in complete end-to-end analysis workflows and their efficacy is demonstrated by running on prototypical embedded systems.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.
-
Power Analysis Based Side Channel Attack
Authors:
Hasindu Gamaarachchi,
Harsha Ganegoda
Abstract:
Power analysis is a branch of side channel attacks where power consumption data is used as the side channel to attack the system. First using a device like an oscilloscope power traces are collected when the cryptographic device is doing the cryptographic operation. Then those traces are statistically analysed using methods such as Correlation Power Analysis (CPA) to derive the secret key of the s…
▽ More
Power analysis is a branch of side channel attacks where power consumption data is used as the side channel to attack the system. First using a device like an oscilloscope power traces are collected when the cryptographic device is doing the cryptographic operation. Then those traces are statistically analysed using methods such as Correlation Power Analysis (CPA) to derive the secret key of the system. Being possible to break Advanced Encryption Standard (AES) in few minutes, power analysis attacks have become a serious security issue for cryptographic devices such as smart card.
As the first phase of our project, we build a testbed for doing research on power analysis attacks. As power analysis is a practical type of attack in order to do any research, a testbed is the first requirement. Since building a test bed is a complicated process, having a pre-built testbed would save the time of future researchers. The second phase of our project is to attack the latest cryptographic algorithm called Speck which has been released by National Security Agency (NSA) for use in embedded systems. In spite it has lot of differences to AES making impossible to directly use the power analysis approach used for AES, we introduce novel approaches to break Speck in less than an hour. In the third phase of the project, we select few already introduced countermeasures and practically attack them on our testbed to do a comparative analysis. We show that software countermeasures such as random instruction injection and randomly shuffling S-boxes are good enough for their simplicity and cost. But we identify the possible threat due to the problem of generating a good seed for the pseudo-random algorithm running on the microcontroller. We attempt to address this issue by using a hardware-based true random generator that amplifies a random electrical signal and samples to generate a proper seed.
△ Less
Submitted 3 January, 2018;
originally announced January 2018.
-
Accelerating Correlation Power Analysis Using Graphics Processing Units
Authors:
Hasindu Gamaarachchi,
Roshan Ragel,
Darshana Jayasinghe
Abstract:
Correlation Power Analysis (CPA) is a type of power analysis based side channel attack that can be used to derive the secret key of encryption algorithms including DES (Data Encryption Standard) and AES (Advanced Encryption Standard). A typical CPA attack on unprotected AES is performed by analysing a few thousand power traces that requires about an hour of computational time on a general purpose…
▽ More
Correlation Power Analysis (CPA) is a type of power analysis based side channel attack that can be used to derive the secret key of encryption algorithms including DES (Data Encryption Standard) and AES (Advanced Encryption Standard). A typical CPA attack on unprotected AES is performed by analysing a few thousand power traces that requires about an hour of computational time on a general purpose CPU. Due to the severity of this situation, a large number of researchers work on countermeasures to such attacks. Verifying that a proposed countermeasure works well requires performing the CPA attack on about 1.5 million power traces. Such processing, even for a single attempt of verification on commodity hardware would run for several days making the verification process infeasible. Modern Graphics Processing Units (GPUs) have support for thousands of light weight threads, making them ideal for parallelizable algorithms like CPA. While the cost of a GPU being lesser than a high performance multicore server, still the GPU performance for this algorithm is many folds better than that of a multicore server. We present an algorithm and its implementation on GPU for CPA on 128-bit AES that is capable of executing 1300x faster than that on a single threaded CPU and more than 60x faster than that on a 32 threaded multicore server. We show that an attack that would take hours on the multicore server would take even less than a minute on a much cost effective GPU.
△ Less
Submitted 24 December, 2014;
originally announced December 2014.