-
On the Performance of Cloud-based ARM SVE for Zero-Knowledge Proving Systems
Authors:
Dumitrel Loghin,
Shuang Liang,
Shengwei Liu,
Xiong Liu,
Pingcheng Ruan,
Zhigang Ye
Abstract:
Zero-knowledge proofs (ZKP) are becoming a gold standard in scaling blockchains and bringing Web3 to life. At the same time, ZKP for transactions running on the Ethereum Virtual Machine require powerful servers with hundreds of CPU cores. The current zkProver implementation from Polygon is optimized for x86-64 CPUs by vectorizing key operations, such as Merkle tree building with Poseidon hashes ov…
▽ More
Zero-knowledge proofs (ZKP) are becoming a gold standard in scaling blockchains and bringing Web3 to life. At the same time, ZKP for transactions running on the Ethereum Virtual Machine require powerful servers with hundreds of CPU cores. The current zkProver implementation from Polygon is optimized for x86-64 CPUs by vectorizing key operations, such as Merkle tree building with Poseidon hashes over the Goldilocks field, with Advanced Vector Extensions (AVX and AVX512). With these optimizations, a ZKP for a batch of transactions is generated in less than two minutes. With the advent of cloud servers with ARM which are at least 10% cheaper than x86-64 servers and the implementation of ARM Scalable Vector Extension (SVE), we wonder if ARM servers can take over their x86-64 counterparts. Unfortunately, our analysis shows that current ARM CPUs are not a match for their x86-64 competitors. Graviton4 from Amazon Web Services (AWS) and Axion from Google Cloud Platform (GCP) are 1.6X and 1.4X slower compared to the latest AMD EPYC and Intel Xeon servers from AWS with AVX and AVX512, respectively, when building a Merkle tree with over four million leaves. This low performance is due to (1) smaller vector size in these ARM CPUs (128 bits versus 512 bits in AVX512) and (2) lower clock frequency. On the other hand, ARM SVE/SVE2 Instruction Set Architecture (ISA) is at least as powerful as AVX/AVX512 but more flexible. Moreover, we estimate that increasing the vector size to 512 bits will enable higher performance in ARM CPUs compared to their x86-64 counterparts while maintaining their price advantage.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
A Number Representation Systems Library Supporting New Representations Based on Morris Tapered Floating-point with Hidden Exponent Bit
Authors:
Stefan-Dan Ciocirlan,
Dumitrel Loghin
Abstract:
The introduction of posit reopened the debate about the utility of IEEE754 in specific domains. In this context, we propose a high-level language (Scala) library that aims to reduce the effort of designing and testing new number representation systems (NRSs). The library's efficiency is tested with three new NRSs derived from Morris Tapered Floating-Point by adding a hidden exponent bit. We call t…
▽ More
The introduction of posit reopened the debate about the utility of IEEE754 in specific domains. In this context, we propose a high-level language (Scala) library that aims to reduce the effort of designing and testing new number representation systems (NRSs). The library's efficiency is tested with three new NRSs derived from Morris Tapered Floating-Point by adding a hidden exponent bit. We call these NRSs MorrisHEB, MorrisBiasHEB, and MorrisUnaryHEB, respectively. We show that they offer a better dynamic range, better decimal accuracy for unary operations, more exact results for addition (37.61% in the case of MorrisUnaryHEB), and better average decimal accuracy for inexact results on binary operations than posit and IEEE754. Going through existing benchmarks in the literature, and favorable/unfavorable examples for IEEE754/posit, we show that these new NRSs produce similar (less than one decimal accuracy difference) or even better results than IEEE754 and posit. Given the entire spectrum of results, there are arguments for MorrisBiasHEB to be used as a replacement for IEEE754 in general computations. MorrisUnaryHEB has a more populated ``golden zone'' (+13.6%) and a better dynamic range (149X) than posit, making it a candidate for machine learning computations.
△ Less
Submitted 15 October, 2023;
originally announced October 2023.
-
Blockchain Goes Green? Part II: Characterizing the Performance and Cost of Blockchains on the Cloud and at the Edge
Authors:
Dumitrel Loghin,
Tien Tuan Anh Dinh,
Aung Maw,
Chen Gang,
Yong Meng Teo,
Beng Chin Ooi
Abstract:
While state-of-the-art permissioned blockchains can achieve thousands of transactions per second on commodity hardware with x86/64 architecture, their performance when running on different architectures is not clear. The goal of this work is to characterize the performance and cost of permissioned blockchains on different hardware systems, which is important as diverse application domains are adop…
▽ More
While state-of-the-art permissioned blockchains can achieve thousands of transactions per second on commodity hardware with x86/64 architecture, their performance when running on different architectures is not clear. The goal of this work is to characterize the performance and cost of permissioned blockchains on different hardware systems, which is important as diverse application domains are adopting t. To this end, we conduct extensive cost and performance evaluation of two permissioned blockchains, namely Hyperledger Fabric and ConsenSys Quorum, on five different types of hardware covering both x86/64 and ARM architecture, as well as, both cloud and edge computing. The hardware nodes include servers with Intel Xeon CPU, servers with ARM-based Amazon Graviton CPU, and edge devices with ARM-based CPU. Our results reveal a diverse profile of the two blockchains across different settings, demonstrating the impact of hardware choices on the overall performance and cost. We find that Graviton servers outperform Xeon servers in many settings, due to their powerful CPU and high memory bandwidth. Edge devices with ARM architecture, on the other hand, exhibit low performance. When comparing the cloud with the edge, we show that the cost of the latter is much smaller in the long run if manpower cost is not considered.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
The Accuracy and Efficiency of Posit Arithmetic
Authors:
Stefan Dan Ciocirlan,
Dumitrel Loghin,
Lavanya Ramapantulu,
Nicolae Tapus,
Yong Meng Teo
Abstract:
Motivated by the increasing interest in the posit numeric format, in this paper we evaluate the accuracy and efficiency of posit arithmetic in contrast to the traditional IEEE 754 32-bit floating-point (FP32) arithmetic. We first design and implement a Posit Arithmetic Unit (PAU), called POSAR, with flexible bit-sized arithmetic suitable for applications that can trade accuracy for savings in chip…
▽ More
Motivated by the increasing interest in the posit numeric format, in this paper we evaluate the accuracy and efficiency of posit arithmetic in contrast to the traditional IEEE 754 32-bit floating-point (FP32) arithmetic. We first design and implement a Posit Arithmetic Unit (PAU), called POSAR, with flexible bit-sized arithmetic suitable for applications that can trade accuracy for savings in chip area. Next, we analyze the accuracy and efficiency of POSAR with a series of benchmarks including mathematical computations, ML kernels, NAS Parallel Benchmarks (NPB), and Cifar-10 CNN. This analysis is done on our implementation of POSAR integrated into a RISC-V Rocket Chip core in comparison with the IEEE 754-based Floting Point Unit (FPU) of Rocket Chip. Our analysis shows that POSAR can outperform the FPU, but the results are not spectacular. For NPB, 32-bit posit achieves better accuracy than FP32 and improves the execution by up to 2%. However, POSAR with 32-bit posit needs 30% more FPGA resources compared to the FPU. For classic ML algorithms, we find that 8-bit posits are not suitable to replace FP32 because they exhibit low accuracy leading to wrong results. Instead, 16-bit posit offers the best option in terms of accuracy and efficiency. For example, 16-bit posit achieves the same Top-1 accuracy as FP32 on a Cifar-10 CNN with a speedup of 18%.
△ Less
Submitted 16 September, 2021;
originally announced September 2021.
-
Understanding the Scalability of Hyperledger Fabric
Authors:
Minh Quang Nguyen,
Dumitrel Loghin,
Tien Tuan Anh Dinh
Abstract:
The rapid growth of blockchain systems leads to increasing interest in understanding and comparing blockchain performance at scale. In this paper, we focus on analyzing the performance of Hyperledger Fabric v1.1 - one of the most popular permissioned blockchain systems. Prior works have analyzed Hyperledger Fabric v0.6 in depth, but newer versions of the system undergo significant changes that war…
▽ More
The rapid growth of blockchain systems leads to increasing interest in understanding and comparing blockchain performance at scale. In this paper, we focus on analyzing the performance of Hyperledger Fabric v1.1 - one of the most popular permissioned blockchain systems. Prior works have analyzed Hyperledger Fabric v0.6 in depth, but newer versions of the system undergo significant changes that warrant new analysis. Existing works on benchmarking the system are limited in their scope: some consider only small networks, others consider scalability of only parts of the system instead of the whole. We perform a comprehensive performance analysis of Hyperledger Fabric v1.1 at scale. We extend an existing benchmarking tool to conduct experiments over many servers while scaling all important components of the system. Our results demonstrate that Fabric v1.1's scalability bottlenecks lie in the communication overhead between the execution and ordering phase. Furthermore, we show that scaling the Kafka cluster that is used for the ordering phase does not affect the overall throughput.
△ Less
Submitted 21 July, 2021;
originally announced July 2021.
-
Communication-efficient Decentralized Machine Learning over Heterogeneous Networks
Authors:
Pan Zhou,
Qian Lin,
Dumitrel Loghin,
Beng Chin Ooi,
Yuncheng Wu,
Hongfang Yu
Abstract:
In the last few years, distributed machine learning has been usually executed over heterogeneous networks such as a local area network within a multi-tenant cluster or a wide area network connecting data centers and edge clusters. In these heterogeneous networks, the link speeds among worker nodes vary significantly, making it challenging for state-of-the-art machine learning approaches to perform…
▽ More
In the last few years, distributed machine learning has been usually executed over heterogeneous networks such as a local area network within a multi-tenant cluster or a wide area network connecting data centers and edge clusters. In these heterogeneous networks, the link speeds among worker nodes vary significantly, making it challenging for state-of-the-art machine learning approaches to perform efficient training. Both centralized and decentralized training approaches suffer from low-speed links. In this paper, we propose a decentralized approach, namely NetMax, that enables worker nodes to communicate via high-speed links and, thus, significantly speed up the training process. NetMax possesses the following novel features. First, it consists of a novel consensus algorithm that allows worker nodes to train model copies on their local dataset asynchronously and exchange information via peer-to-peer communication to synchronize their local copies, instead of a central master node (i.e., parameter server). Second, each worker node selects one peer randomly with a fine-tuned probability to exchange information per iteration. In particular, peers with high-speed links are selected with high probability. Third, the probabilities of selecting peers are designed to minimize the total convergence time. Moreover, we mathematically prove the convergence of NetMax. We evaluate NetMax on heterogeneous cluster networks and show that it achieves speedups of 3.7X, 3.4X, and 1.9X in comparison with the state-of-the-art decentralized training approaches Prague, Allreduce-SGD, and AD-PSGD, respectively.
△ Less
Submitted 20 October, 2020; v1 submitted 12 September, 2020;
originally announced September 2020.
-
A Transactional Perspective on Execute-order-validate Blockchains
Authors:
Pingcheng Ruan,
Dumitrel Loghin,
Quang-Trung Ta,
Meihui Zhang,
Gang Chen,
Beng Chin Ooi
Abstract:
Smart contracts have enabled blockchain systems to evolve from simple cryptocurrency platforms, such as Bitcoin, to general transactional systems, such as Ethereum. Catering for emerging business requirements, a new architecture called execute-order-validate has been proposed in Hyperledger Fabric to support parallel transactions and improve the blockchain's throughput. However, this new architect…
▽ More
Smart contracts have enabled blockchain systems to evolve from simple cryptocurrency platforms, such as Bitcoin, to general transactional systems, such as Ethereum. Catering for emerging business requirements, a new architecture called execute-order-validate has been proposed in Hyperledger Fabric to support parallel transactions and improve the blockchain's throughput. However, this new architecture might render many invalid transactions when serializing them. This problem is further exaggerated as the block formation rate is inherently limited due to other factors beside data processing, such as cryptography and consensus.
In this work, we propose a novel method to enhance the execute-order-validate architecture, by reducing invalid transactions to improve the throughput of blockchains. Our method is inspired by state-of-the-art optimistic concurrency control techniques in modern database systems. In contrast to existing blockchains that adopt database's preventive approaches which might abort serializable transactions, our method is theoretically more fine-grained. Specifically, unserializable transactions are aborted before ordering and the remaining transactions are guaranteed to be serializable. For evaluation, we implement our method in two blockchains respectively, FabricSharp on top of Hyperledger Fabric, and FastFabricSharp on top of FastFabric. We compare the performance of FabricSharp with vanilla Fabric and three related systems, two of which are respectively implemented with one standard and one state-of-the-art concurrency control techniques from databases. The results demonstrate that FabricSharp achieves 25% higher throughput compared to the other systems in nearly all experimental scenarios. Moreover, the FastFabricSharp's improvement over FastFabric is up to 66%.
△ Less
Submitted 22 March, 2020;
originally announced March 2020.
-
Diffusion in arrays of obstacles: beyond homogenisation
Authors:
Yahya Farah,
Daniel Loghin,
Alexandra Tzella,
Jacques Vanneste
Abstract:
We revisit the classical problem of diffusion of a scalar (or heat) released in a two-dimensional medium with an embedded periodic array of impermeable obstacles such as perforations. Homogenisation theory provides a coarse-grained description of the scalar at large times and predicts that it diffuses with a certain effective diffusivity, so the concentration is approximately Gaussian. We improve…
▽ More
We revisit the classical problem of diffusion of a scalar (or heat) released in a two-dimensional medium with an embedded periodic array of impermeable obstacles such as perforations. Homogenisation theory provides a coarse-grained description of the scalar at large times and predicts that it diffuses with a certain effective diffusivity, so the concentration is approximately Gaussian. We improve on this by developing a large-deviation approximation which also captures the non-Gaussian tails of the concentration through a rate function obtained by solving a family of eigenvalue problems. We focus on cylindrical obstacles and on the dense limit, when the obstacles occupy a large area fraction and non-Gaussianity is most marked. We derive an asymptotic approximation for the rate function in this limit, valid uniformly over a wide range of distances. We use finite-element implementations to solve the eigenvalue problems yielding the rate function for arbitrary obstacle area fractions and an elliptic boundary-value problem arising in the asymptotics calculation. Comparison between numerical results and asymptotic predictions confirm the validity of the latter.
△ Less
Submitted 16 November, 2020; v1 submitted 4 February, 2020;
originally announced February 2020.
-
Blockchains vs. Distributed Databases: Dichotomy and Fusion
Authors:
Pingcheng Ruan,
Tien Tuan Anh Dinh,
Dumitrel Loghin,
Meihui Zhang,
Gang Chen,
Qian Lin,
Beng Chin Ooi
Abstract:
Blockchain has come a long way: a system that was initially proposed specifically for cryptocurrencies is now being adapted and adopted as a general-purpose transactional system. As blockchain evolves into another data management system, the natural question is how it compares against distributed database systems. Existing works on this comparison focus on high-level properties, such as security a…
▽ More
Blockchain has come a long way: a system that was initially proposed specifically for cryptocurrencies is now being adapted and adopted as a general-purpose transactional system. As blockchain evolves into another data management system, the natural question is how it compares against distributed database systems. Existing works on this comparison focus on high-level properties, such as security and throughput. They stop short of showing how the underlying design choices contribute to the overall differences. Our work fills this important gap and provides a principled framework for analyzing the emerging trend of blockchain-database fusion.
We perform a twin study of blockchains and distributed database systems as two types of transactional systems. We propose a taxonomy that illustrates the dichotomy across four dimensions, namely replication, concurrency, storage, and sharding. Within each dimension, we discuss how the design choices are driven by two goals: security for blockchains, and performance for distributed databases. To expose the impact of different design choices on the overall performance, we conduct an in-depth performance analysis of two blockchains, namely Quorum and Hyperledger Fabric, and two distributed databases, namely TiDB, and etcd. Lastly, we propose a framework for back-of-the-envelope performance forecast of blockchain-database hybrids.
△ Less
Submitted 15 January, 2021; v1 submitted 3 October, 2019;
originally announced October 2019.
-
5G: Agent for Further Digital Disruptive Transformations
Authors:
Beng Chin Ooi,
Gang Chen,
Dumitrel Loghin,
Wei Wang,
Meihui Zhang
Abstract:
The fifth-generation (5G) mobile communication technologies are on the way to be adopted as the next standard for mobile networking. It is therefore timely to analyze the impact of 5G on the landscape of computing, in particular, data management and data-driven technologies. With a predicted increase of 10-100$\times$ in bandwidth and 5-10$\times$ decrease in latency, 5G is expected to be the main…
▽ More
The fifth-generation (5G) mobile communication technologies are on the way to be adopted as the next standard for mobile networking. It is therefore timely to analyze the impact of 5G on the landscape of computing, in particular, data management and data-driven technologies. With a predicted increase of 10-100$\times$ in bandwidth and 5-10$\times$ decrease in latency, 5G is expected to be the main enabler for edge computing which includes accessing cloud-like services, as well as conducting machine learning at the edge. In this paper, we examine the impact of 5G on both traditional and emerging technologies, and discuss research challenges and opportunities.
△ Less
Submitted 23 September, 2019;
originally announced September 2019.
-
The Disruptions of 5G on Data-driven Technologies and Applications
Authors:
Dumitrel Loghin,
Shaofeng Cai,
Gang Chen,
Tien Tuan Anh Dinh,
Feiyi Fan,
Qian Lin,
Janice Ng,
Beng Chin Ooi,
Xutao Sun,
Quang-Trung Ta,
Wei Wang,
Xiaokui Xiao,
Yang Yang,
Meihui Zhang,
Zhonghua Zhang
Abstract:
With 5G on the verge of being adopted as the next mobile network, there is a need to analyze its impact on the landscape of computing and data management. In this paper, we analyze the impact of 5G on both traditional and emerging technologies and project our view on future research challenges and opportunities. With a predicted increase of 10-100x in bandwidth and 5-10x decrease in latency, 5G is…
▽ More
With 5G on the verge of being adopted as the next mobile network, there is a need to analyze its impact on the landscape of computing and data management. In this paper, we analyze the impact of 5G on both traditional and emerging technologies and project our view on future research challenges and opportunities. With a predicted increase of 10-100x in bandwidth and 5-10x decrease in latency, 5G is expected to be the main enabler for smart cities, smart IoT and efficient healthcare, where machine learning is conducted at the edge. In this context, we investigate how 5G can help the development of federated learning. Network slicing, another key feature of 5G, allows running multiple isolated networks on the same physical infrastructure. However, security remains the main concern in the context of virtualization, multi-tenancy and high device density. Formal verification of 5G networks can be applied to detect security issues in massive virtualized environments. In summary, 5G will make the world even more densely and closely connected. What we have experienced in 4G connectivity will pale in comparison to the vast amounts of possibilities engendered by 5G.
△ Less
Submitted 15 December, 2019; v1 submitted 6 September, 2019;
originally announced September 2019.
-
Blockchain Goes Green? An Analysis of Blockchain on Low-Power Nodes
Authors:
Dumitrel Loghin,
Gang Chen,
Tien Tuan Anh Dinh,
Beng Chin Ooi,
Yong Meng Teo
Abstract:
Motivated by the massive energy usage of blockchain, on the one hand, and by significant performance improvements in low-power, wimpy systems, on the other hand, we perform an in-depth time-energy analysis of blockchain systems on low-power nodes in comparison to high-performance nodes. We use three low-power systems to represent a wide range of the performance-power spectrum, while covering both…
▽ More
Motivated by the massive energy usage of blockchain, on the one hand, and by significant performance improvements in low-power, wimpy systems, on the other hand, we perform an in-depth time-energy analysis of blockchain systems on low-power nodes in comparison to high-performance nodes. We use three low-power systems to represent a wide range of the performance-power spectrum, while covering both x86/64 and ARM architectures. We show that low-end wimpy nodes are struggling to run full-fledged blockchains mainly due to their small and low-bandwidth memory. On the other hand, wimpy systems with balanced performance-to-power ratio achieve reasonable performance while saving significant amounts of energy. For example, Jetson TX2 nodes achieve around 80% and 30% of the throughput of Parity and Hyperledger, respectively, while using 18x and 23x less energy compared to traditional brawny servers with Intel Xeon CPU.
△ Less
Submitted 17 June, 2019; v1 submitted 16 May, 2019;
originally announced May 2019.
-
Towards Scaling Blockchain Systems via Sharding
Authors:
Hung Dang,
Tien Tuan Anh Dinh,
Dumitrel Loghin,
Ee-Chien Chang,
Qian Lin,
Beng Chin Ooi
Abstract:
Existing blockchain systems scale poorly because of their distributed consensus protocols. Current attempts at improving blockchain scalability are limited to cryptocurrency. Scaling blockchain systems under general workloads (i.e., non-cryptocurrency applications) remains an open question. In this work, we take a principled approach to apply sharding, which is a well-studied and proven technique…
▽ More
Existing blockchain systems scale poorly because of their distributed consensus protocols. Current attempts at improving blockchain scalability are limited to cryptocurrency. Scaling blockchain systems under general workloads (i.e., non-cryptocurrency applications) remains an open question. In this work, we take a principled approach to apply sharding, which is a well-studied and proven technique to scale out databases, to blockchain systems in order to improve their transaction throughput at scale. This is challenging, however, due to the fundamental difference in failure models between databases and blockchain. To achieve our goal, we first enhance the performance of Byzantine consensus protocols, by doing so we improve individual shards' throughput. Next, we design an efficient shard formation protocol that leverages a trusted random beacon to securely assign nodes into shards. We rely on trusted hardware, namely Intel SGX, to achieve high performance for both consensus and shard formation protocol. Third, we design a general distributed transaction protocol that ensures safety and liveness even when transaction coordinators are malicious. Finally, we conduct an extensive evaluation of our design both on a local cluster and on Google Cloud Platform. The results show that our consensus and shard formation protocols outperform state-of-the-art solutions at scale. More importantly, our sharded blockchain reaches a high throughput that can handle Visa-level workloads, and is the largest ever reported in a realistic environment.
△ Less
Submitted 12 March, 2019; v1 submitted 2 April, 2018;
originally announced April 2018.