-
A Primer on RecoNIC: RDMA-enabled Compute Offloading on SmartNIC
Authors:
Guanwen Zhong,
Aditya Kolekar,
Burin Amornpaisannon,
Inho Choi,
Haris Javaid,
Mario Baldi
Abstract:
Today's data centers consist of thousands of network-connected hosts, each with CPUs and accelerators such as GPUs and FPGAs. These hosts also contain network interface cards (NICs), operating at speeds of 100Gb/s or higher, that are used to communicate with each other. We propose RecoNIC, an FPGA-based RDMA-enabled SmartNIC platform that is designed for compute acceleration while minimizing the o…
▽ More
Today's data centers consist of thousands of network-connected hosts, each with CPUs and accelerators such as GPUs and FPGAs. These hosts also contain network interface cards (NICs), operating at speeds of 100Gb/s or higher, that are used to communicate with each other. We propose RecoNIC, an FPGA-based RDMA-enabled SmartNIC platform that is designed for compute acceleration while minimizing the overhead associated with data copies (in CPU-centric accelerator systems) by bringing network data as close to computation as possible. Since RDMA is the defacto transport-layer protocol for improved communication in data center workloads, RecoNIC includes an RDMA offload engine for high throughput and low latency data transfers. Developers have the flexibility to design their accelerators using RTL, HLS or Vitis Networking P4 within the RecoNIC's programmable compute blocks. These compute blocks can access host memory as well as memory in remote peers through the RDMA offload engine. Furthermore, the RDMA offload engine is shared by both the host and compute blocks, which makes RecoNIC a very flexible platform. Lastly, we have open-sourced RecoNIC for the research community to enable experimentation with RDMA-based applications and use-cases.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Improving Energy Efficiency of Permissioned Blockchains Using FPGAs
Authors:
Nathania Santoso,
Haris Javaid
Abstract:
Permissioned blockchains like Hyperledger Fabric have become quite popular for implementation of enterprise applications. Recent research has mainly focused on improving performance of permissioned blockchains without any consideration of their power/energy consumption. In this paper, we conduct a comprehensive empirical study to understand energy efficiency (throughput/energy) of validator peer i…
▽ More
Permissioned blockchains like Hyperledger Fabric have become quite popular for implementation of enterprise applications. Recent research has mainly focused on improving performance of permissioned blockchains without any consideration of their power/energy consumption. In this paper, we conduct a comprehensive empirical study to understand energy efficiency (throughput/energy) of validator peer in Hyperledger Fabric (a major bottleneck node). We pick a number of optimizations for validator peer from literature (allocated CPUs, software block cache and FPGA based accelerator). First, we propose a methodology to measure power/energy consumption of the two resulting compute platforms (CPU-only and CPU+FPGA). Then, we use our methodology to evaluate energy efficiency of a diverse set of validator peer configurations, and present many useful insights. With careful selection of software optimizations and FPGA accelerator configuration, we improved energy efficiency of validator peer by 10$\times$ compared to vanilla validator peer (i.e., energy-aware provisioning of validator peer can deliver 10$\times$ more throughput while consuming the same amount of energy). In absolute terms, this means 23,000 tx/s with power consumption of 118W from a validator peer using software block cache running on a 4-core server with AMD/Xilinx Alveo U250 FPGA card.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference
Authors:
Jing Gong,
Hassaan Saadat,
Hasindu Gamaarachchi,
Haris Javaid,
Xiaobo Sharon Hu,
Sri Parameswaran
Abstract:
Edge training of Deep Neural Networks (DNNs) is a desirable goal for continuous learning; however, it is hindered by the enormous computational power required by training. Hardware approximate multipliers have shown their effectiveness for gaining resource-efficiency in DNN inference accelerators; however, training with approximate multipliers is largely unexplored. To build resource efficient acc…
▽ More
Edge training of Deep Neural Networks (DNNs) is a desirable goal for continuous learning; however, it is hindered by the enormous computational power required by training. Hardware approximate multipliers have shown their effectiveness for gaining resource-efficiency in DNN inference accelerators; however, training with approximate multipliers is largely unexplored. To build resource efficient accelerators with approximate multipliers supporting DNN training, a thorough evaluation of training convergence and accuracy for different DNN architectures and different approximate multipliers is needed. This paper presents ApproxTrain, an open-source framework that allows fast evaluation of DNN training and inference using simulated approximate multipliers. ApproxTrain is as user-friendly as TensorFlow (TF) and requires only a high-level description of a DNN architecture along with C/C++ functional models of the approximate multiplier. We improve the speed of the simulation at the multiplier level by using a novel LUT-based approximate floating-point (FP) multiplier simulator on GPU (AMSim). ApproxTrain leverages CUDA and efficiently integrates AMSim into the TensorFlow library, in order to overcome the absence of native hardware approximate multiplier in commercial GPUs. We use ApproxTrain to evaluate the convergence and accuracy of DNN training with approximate multipliers for small and large datasets (including ImageNet) using LeNets and ResNets architectures. The evaluations demonstrate similar convergence behavior and negligible change in test accuracy compared to FP32 and bfloat16 multipliers. Compared to CPU-based approximate multiplier simulations in training and inference, the GPU-accelerated ApproxTrain is more than 2500x faster. Based on highly optimized closed-source cuDNN/cuBLAS libraries with native hardware multipliers, the original TensorFlow is only 8x faster than ApproxTrain.
△ Less
Submitted 23 September, 2022; v1 submitted 9 September, 2022;
originally announced September 2022.
-
Efficient FPGA-based ECDSA Verification Engine for Permissioned Blockchains
Authors:
Rashmi Agrawal,
Ji Yang,
Haris Javaid
Abstract:
As enterprises embrace blockchain technology, many real-world applications have been developed and deployed using permissioned blockchain platforms (access to network is controlled and given to only nodes with known identities). Such blockchain platforms heavily depend on cryptography to provide a layer of trust within the network, thus verification of cryptographic signatures often becomes the bo…
▽ More
As enterprises embrace blockchain technology, many real-world applications have been developed and deployed using permissioned blockchain platforms (access to network is controlled and given to only nodes with known identities). Such blockchain platforms heavily depend on cryptography to provide a layer of trust within the network, thus verification of cryptographic signatures often becomes the bottleneck. The Elliptic Curve Digital Signature Algorithm (ECDSA) is the most commonly used cryptographic scheme in permissioned blockchains. In this paper, we propose an efficient implementation of ECDSA signature verification on an FPGA, in order to improve the performance of permissioned blockchains that aim to use FPGA-based hardware accelerators.
We propose several optimizations for modular arithmetic (e.g., custom multipliers and fast modular reduction) and point arithmetic (e.g., reduced number of point double and addition operations, and optimal width NAF representation). Based on these optimized modular and point arithmetic modules, we propose an ECDSA verification engine that can be used by any application for fast verification of ECDSA signatures. We further optimize our ECDSA verification engine for Hyperledger Fabric (one of the most widely used permissioned blockchain platforms) by moving carefully selected operations to a precomputation block, thus simplifying the critical path of ECDSA signature verification. From our implementation on Xilinx Alveo U250 accelerator board with target frequency of 250MHz, our ECDSA verification engine can perform a single verification in $760μs$ resulting in a throughput of 1,315 verifications per second, which is ~2.5x faster than state-of-the-art FPGA-based implementations. Our Hyperledger Fabric-specific ECDSA engine can perform a single verification in $368μs$ with a throughput of 2,717 verifications per second.
△ Less
Submitted 3 December, 2021;
originally announced December 2021.
-
Blockchain Machine: A Network-Attached Hardware Accelerator for Hyperledger Fabric
Authors:
Haris Javaid,
Ji Yang,
Nathania Santoso,
Mohit Upadhyay,
Sundararajarao Mohan,
Chengchen Hu,
Gordon Brebner
Abstract:
In this paper, we demonstrate how Hyperledger Fabric, one of the most popular permissioned blockchains, can benefit from network-attached acceleration. The scalability and peak performance of Fabric is primarily limited by the bottlenecks present in its block validation/commit phase. We propose Blockchain Machine, a hardware accelerator coupled with a hardware-friendly communication protocol, to a…
▽ More
In this paper, we demonstrate how Hyperledger Fabric, one of the most popular permissioned blockchains, can benefit from network-attached acceleration. The scalability and peak performance of Fabric is primarily limited by the bottlenecks present in its block validation/commit phase. We propose Blockchain Machine, a hardware accelerator coupled with a hardware-friendly communication protocol, to act as the validator peer. It can be adapted to applications and their smart contracts, and is targeted for a server with network-attached FPGA acceleration card. The Blockchain Machine retrieves blocks and their transactions in hardware directly from the network interface, which are then validated through a configurable and efficient block-level and transaction-level pipeline. The validation results are then transferred to the host CPU where non-bottleneck operations are executed. From our implementation integrated with Fabric v1.4 LTS, we observed up to 12x speedup in block validation when compared to software-only validator peer, with commit throughput of up to 68,900 tps. Our work provides an acceleration platform that will foster further research on hardware acceleration of permissioned blockchains.
△ Less
Submitted 20 September, 2021; v1 submitted 14 April, 2021;
originally announced April 2021.
-
Charged anisotropic compact objects obeying Karmarkar condition
Authors:
Y. Gomez-Leyton,
Hina Javaid,
L. S. Rocha,
Francisco Tello-Ortiz
Abstract:
This research develops a well-established analytical solution of the Einstein-Maxwell field equations. We analyze the behavior of a spherically symmetric and static interior driven by a charged anisotropic matter distribution. The class I methodology is used to close the system of equations and a suitable relation between the anisotropy factor and the electric field is imposed. The inner geometry…
▽ More
This research develops a well-established analytical solution of the Einstein-Maxwell field equations. We analyze the behavior of a spherically symmetric and static interior driven by a charged anisotropic matter distribution. The class I methodology is used to close the system of equations and a suitable relation between the anisotropy factor and the electric field is imposed. The inner geometry of this toy model is described using an ansatz for the radial metric potential corresponding to the well-known isotropic Buchdahl space-time. The main properties are explored in order to determine if the obtained model is appropriate to represent a real compact body such as neutron or quark star. {We have fixed the mass and radii using the data of the compact objects} SMC X-1 and LMC X-4. It was found that the electric field and electric charge have magnitudes of the order of $\sim 10^{21}\ [V/cm]$ and $\sim 10^{20}\ [C]$, respectively. The magnitude of the electric field and electric charge depends on the dimensionless parameter $χ$. To observe these effects on the total mass, mass-radius ratio and surface gravitational red-shift, we computed numerical data for different values of $χ$.
△ Less
Submitted 10 December, 2020;
originally announced December 2020.
-
Optimizing Validation Phase of Hyperledger Fabric
Authors:
Haris Javaid,
Chengchen Hu,
Gordon Brebner
Abstract:
Blockchain technologies are on the rise, and Hyperledger Fabric is one of the most popular permissioned blockchain platforms. In this paper, we re-architect the validation phase of Fabric based on our analysis from fine-grained breakdown of the validation phase's latency. Our optimized validation phase uses a chaincode cache during validation of transactions, initiates state database reads in para…
▽ More
Blockchain technologies are on the rise, and Hyperledger Fabric is one of the most popular permissioned blockchain platforms. In this paper, we re-architect the validation phase of Fabric based on our analysis from fine-grained breakdown of the validation phase's latency. Our optimized validation phase uses a chaincode cache during validation of transactions, initiates state database reads in parallel with validation of transactions, and writes to the ledger and databases in parallel. Our experiments reveal performance improvements of 2x for CouchDB and 1.3x for LevelDB. Notably, our optimizations can be adopted in a future release of Hyperledger Fabric.
△ Less
Submitted 19 July, 2019;
originally announced July 2019.