-
How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis
Authors:
Bekir Turkkan,
Elvis Rodrigues,
Tevfik Kosar,
Aleksey Charapko,
Ailidani Ailijiang,
Murat Demirbas
Abstract:
Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of a standard benchmarking tool for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tol…
▽ More
Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of a standard benchmarking tool for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare known and widely used distributed coordination services, their evaluations, and the tools used to benchmark those systems. We identify important requirements of distributed coordination service benchmarking, like the metrics and parameters that need to be evaluated and their evaluation setups and tools.
△ Less
Submitted 22 January, 2025; v1 submitted 14 March, 2024;
originally announced March 2024.
-
BunchBFT: Across-Cluster Consensus Protocol
Authors:
Salem Alqahtani,
Murat Demirbas
Abstract:
In this paper, we present BunchBFT Byzantine fault-tolerant state-machine replication for high performance and scalability. At the heart of BunchBFT is a novel design called the cluster-based approach that divides the replicas into clusters of replicas. By combining this cluster-based approach with hierarchical communications across clusters, piggybacking techniques for sending messages across clu…
▽ More
In this paper, we present BunchBFT Byzantine fault-tolerant state-machine replication for high performance and scalability. At the heart of BunchBFT is a novel design called the cluster-based approach that divides the replicas into clusters of replicas. By combining this cluster-based approach with hierarchical communications across clusters, piggybacking techniques for sending messages across clusters, and decentralized leader election for each cluster, BunchBFT achieves high performance and scalability.
We also prove that BunchBFT satisfies the basic safety and liveness properties of Byzantine consensus. We implemented a prototype of BunchBFT in our PaxiBFT framework to show that the BunchBFT can improve the MirBFT's throughput by 10x, depending on the available bandwidth on wide-area links.
△ Less
Submitted 21 May, 2022;
originally announced May 2022.
-
BigBFT: A Multileader Byzantine Fault Tolerance Protocol for High Throughput
Authors:
Salem Alqahtani,
Murat Demirbas
Abstract:
This paper describes BigBFT, a multi-leader Byzantine fault tolerance protocol that achieves high throughput and scalable consensus in blockchain systems. BigBFT achieves this by (1) enabling every node to be a leader that can propose and order the blocks in parallel, (2) piggybacking votes within rounds, (3) pipelining blocks across rounds, and (4) using only two communication steps to order bloc…
▽ More
This paper describes BigBFT, a multi-leader Byzantine fault tolerance protocol that achieves high throughput and scalable consensus in blockchain systems. BigBFT achieves this by (1) enabling every node to be a leader that can propose and order the blocks in parallel, (2) piggybacking votes within rounds, (3) pipelining blocks across rounds, and (4) using only two communication steps to order blocks in the common case.
BigBFT has an amortized communication cost of $O(n)$ over $n$ requests. We evaluate BigBFT's performance both analytically, using back-of-the-envelope load formulas to construct a cost analysis, and also empirically by implementing it in our PaxiBFT framework. Our evaluation compares BigBFT with PBFT, Tendermint, Streamlet, and Hotstuff under various workloads using deployments of 4 to 20 nodes. Our results show that BigBFT outperforms PBFT, Tendermint, Streamlet, and Hotstuff protocols either in terms of latency (by up to $70\%$) or in terms of throughput (by up to $190\%$).
△ Less
Submitted 12 October, 2021; v1 submitted 26 September, 2021;
originally announced September 2021.
-
Bottlenecks in Blockchain Consensus Protocols
Authors:
Salem Alqahtani,
Murat Demirbas
Abstract:
Most of the Blockchain permissioned systems employ Byzantine fault-tolerance (BFT) consensus protocols to ensure that honest validators agree on the order for appending entries to their ledgers. In this paper, we study the performance and the scalability of prominent consensus protocols, namely PBFT, Tendermint, HotStuff, and Streamlet, both analytically via load formulas and practically via imple…
▽ More
Most of the Blockchain permissioned systems employ Byzantine fault-tolerance (BFT) consensus protocols to ensure that honest validators agree on the order for appending entries to their ledgers. In this paper, we study the performance and the scalability of prominent consensus protocols, namely PBFT, Tendermint, HotStuff, and Streamlet, both analytically via load formulas and practically via implementation and evaluation. Under identical conditions, we identify the bottlenecks of these consensus protocols and show that these protocols do not scale well as the number of validators increases. Our investigation points to the communication complexity as the culprit. Even when there is enough network bandwidth, the CPU cost of serialization and deserialization of the messages limits the throughput and increases the latency of the protocols. To alleviate the bottlenecks, the most useful techniques include reducing the communication complexity, rotating the hotspot of communications, and pipelining across consensus instances.
△ Less
Submitted 11 October, 2021; v1 submitted 6 March, 2021;
originally announced March 2021.
-
Scaling Replicated State Machines with Compartmentalization [Technical Report]
Authors:
Michael Whittaker,
Ailidani Ailijiang,
Aleksey Charapko,
Murat Demirbas,
Neil Giridharan,
Joseph M. Hellerstein,
Heidi Howard,
Ion Stoica,
Adriana Szekeres
Abstract:
State machine replication protocols, like MultiPaxos and Raft, are a critical component of many distributed systems and databases. However, these protocols offer relatively low throughput due to several bottlenecked components. Numerous existing protocols fix different bottlenecks in isolation but fall short of a complete solution. When you fix one bottleneck, another arises. In this paper, we int…
▽ More
State machine replication protocols, like MultiPaxos and Raft, are a critical component of many distributed systems and databases. However, these protocols offer relatively low throughput due to several bottlenecked components. Numerous existing protocols fix different bottlenecks in isolation but fall short of a complete solution. When you fix one bottleneck, another arises. In this paper, we introduce compartmentalization, the first comprehensive technique to eliminate state machine replication bottlenecks. Compartmentalization involves decoupling individual bottlenecks into distinct components and scaling these components independently. Compartmentalization has two key strengths. First, compartmentalization leads to strong performance. In this paper, we demonstrate how to compartmentalize MultiPaxos to increase its throughput by 6x on a write-only workload and 16x on a mixed read-write workload. Unlike other approaches, we achieve this performance without the need for specialized hardware. Second, compartmentalization is a technique, not a protocol. Industry practitioners can apply compartmentalization to their protocols incrementally without having to adopt a completely new protocol.
△ Less
Submitted 16 May, 2021; v1 submitted 31 December, 2020;
originally announced December 2020.
-
Scaling Strongly Consistent Replication
Authors:
Aleksey Charapko,
Ailidani Ailijiang,
Murat Demirbas
Abstract:
Strong consistency replication helps keep application logic simple and provides significant benefits for correctness and manageability. Unfortunately, the adoption of strongly-consistent replication protocols has been curbed due to their limited scalability and performance. To alleviate the leader bottleneck in strongly-consistent replication protocols, we introduce Pig, an in-protocol communicati…
▽ More
Strong consistency replication helps keep application logic simple and provides significant benefits for correctness and manageability. Unfortunately, the adoption of strongly-consistent replication protocols has been curbed due to their limited scalability and performance. To alleviate the leader bottleneck in strongly-consistent replication protocols, we introduce Pig, an in-protocol communication aggregation and piggybacking technique. Pig employs randomly selected nodes from follower subgroups to relay the leader's message to the rest of the followers in the subgroup, and to perform in-network aggregation of acknowledgments back from these followers. By randomly alternating the relay nodes across replication operations, Pig shields the relay nodes as well as the leader from becoming hotspots and improves throughput scalability.
We showcase Pig in the context of classical Paxos protocols employed for strongly consistent replication by many cloud computing services and databases. We implement and evaluate PigPaxos, in comparison to Paxos and EPaxos protocols under various workloads over clusters of size 5 to 25 nodes. We show that the aggregation at the relay has little latency overhead, and PigPaxos can provide more than 3 folds improved throughput over Paxos and EPaxos with little latency deterioration. We support our experimental observations with the analytical modeling of the bottlenecks and show that the rotating of the relay nodes provides the most benefit for reducing the bottlenecks and that the throughput is maximized when employing only 1 randomly rotating relay node.
△ Less
Submitted 21 January, 2021; v1 submitted 17 March, 2020;
originally announced March 2020.
-
Performance Analysis and Comparison of Distributed Machine Learning Systems
Authors:
Salem Alqahtani,
Murat Demirbas
Abstract:
Deep learning has permeated through many aspects of computing/processing systems in recent years. While distributed training architectures/frameworks are adopted for training large deep learning models quickly, there has not been a systematic study of the communication bottlenecks of these architectures and their effects on the computation cycle time and scalability. In order to analyze this probl…
▽ More
Deep learning has permeated through many aspects of computing/processing systems in recent years. While distributed training architectures/frameworks are adopted for training large deep learning models quickly, there has not been a systematic study of the communication bottlenecks of these architectures and their effects on the computation cycle time and scalability. In order to analyze this problem for synchronous Stochastic Gradient Descent (SGD) training of deep learning models, we developed a performance model of computation time and communication latency under three different system architectures: Parameter Server (PS), peer-to-peer (P2P), and Ring allreduce (RA). To complement and corroborate our analytical models with quantitative results, we evaluated the computation and communication performance of these system architectures of the systems via experiments performed with Tensorflow and Horovod frameworks. We found that the system architecture has a very significant effect on the performance of training. RA-based systems achieve scalable performance as they successfully decouple network usage from the number of workers in the system. In contrast, 1PS systems suffer from low performance due to network congestion at the parameter server side. While P2P systems fare better than 1PS systems, they still suffer from significant network bottleneck. Finally, RA systems also excel by virtue of overlapping computation time and communication time, which PS and P2P architectures fail to achieve.
△ Less
Submitted 4 September, 2019;
originally announced September 2019.
-
Using Weaker Consistency Models with Monitoring and Recovery for Improving Performance of Key-Value Stores
Authors:
Duong Nguyen,
Aleksey Charapko,
Sandeep S Kulkarni,
Murat Demirbas
Abstract:
Consistency properties provided by most key-value stores can be classified into sequential consistency and eventual consistency. The former is easier to program with but suffers from lower performance whereas the latter suffers from potential anomalies while providing higher performance. We focus on the problem of what a designer should do if he/she has an algorithm that works correctly with seque…
▽ More
Consistency properties provided by most key-value stores can be classified into sequential consistency and eventual consistency. The former is easier to program with but suffers from lower performance whereas the latter suffers from potential anomalies while providing higher performance. We focus on the problem of what a designer should do if he/she has an algorithm that works correctly with sequential consistency but is faced with an underlying key-value store that provides a weaker consistency. We propose a detect-rollback based approach: The designer identifies a correctness predicate, say $P$, and continues to run the protocol, as our system monitors $P$. If $P$ is violated (because of weaker consistency), the system rolls back and resumes the computation at a state where $P$ holds.
We evaluate this approach with graph-based applications running on the Voldemort key-value store. Our experiments with deployment on Amazon AWS EC2 instances shows that using eventual consistency with monitoring can provide a $50\%$ -- $80\%$ increase in throughput when compared with sequential consistency. We also observe that the overhead of the monitoring itself was low (typically less than $4\%$) and the latency of detecting violations was small. In particular, in a scenario designed to intentionally cause a large number of violations, more than $99.9\%$ of violations were detected in less than 50 milliseconds in regional networks, and in less than 3 seconds in global networks.
We find that for some applications, frequent rollback can cause the program using eventual consistency to effectively \textit{stall}. We propose alternate mechanisms for dealing with re-occurring rollbacks. Overall, for applications considered in this paper, we find that even with rollback, eventual consistency provides better performance than using sequential consistency.
△ Less
Submitted 3 September, 2019;
originally announced September 2019.
-
CausalSpartanX: Causal Consistency and Non-Blocking Read-Only Transactions
Authors:
Mohammad Roohitavaf,
Murat Demirbas,
Sandeep Kulkarni
Abstract:
Causal consistency is an intermediate consistency model that can be achieved together with high availability and performance requirements even in presence of network partitions. In the context of partitioned data stores, it has been shown that implicit dependency tracking using timestamps is more efficient than explicit dependency tracking. Existing time-based solutions depend on monotonic psychic…
▽ More
Causal consistency is an intermediate consistency model that can be achieved together with high availability and performance requirements even in presence of network partitions. In the context of partitioned data stores, it has been shown that implicit dependency tracking using timestamps is more efficient than explicit dependency tracking. Existing time-based solutions depend on monotonic psychical clocks that are closely synchronized. These requirements make current protocols vulnerable to clock anomalies. In this paper, we propose a new time-based algorithm, CausalSpartanX, that instead of physical clocks, utilizes Hybrid Logical Clocks (HLCs). We show that using HLCs, without any overhead, we make the system robust on physical clock anomalies. This improvement is more significant in the context of query amplification, where a single query results in multiple GET/PUT operations. We also show that CausalSpartanX decreases the visibility latency for a given data item compared with existing time-based approaches. In turn, this reduces the completion time of collaborative applications where two clients accessing two different replicas edit same items of the data store. CausalSpartanX also provides causally consistent distributed read-only transactions. CausalSpartanX read-only transactions are non-blocking and require only one round of communication between the client and the servers. Also, the slowdowns of partitions that are unrelated to a transaction do not affect the performance of the transaction. Like previous protocols, CausalSpartanX assumes that a given client does not access more than one replica. We show that in presence of network partitions, this assumption (made in several other works) is essential if one were to provide causal consistency as well as immediate availability to local updates.
△ Less
Submitted 17 December, 2018;
originally announced December 2018.
-
Does The Cloud Need Stabilizing?
Authors:
Murat Demirbas,
Aleksey Charapko,
Ailidani Ailijiang
Abstract:
The last decade has witnessed rapid proliferation of cloud computing. While even the smallest distributed programs (with 3-5 actions) produce many unanticipated error cases due to concurrency involved, it seems short of a miracle these web-services are able to operate at those vast scales. In this paper, we explore the factors that contribute most to the high-availability of cloud computing servic…
▽ More
The last decade has witnessed rapid proliferation of cloud computing. While even the smallest distributed programs (with 3-5 actions) produce many unanticipated error cases due to concurrency involved, it seems short of a miracle these web-services are able to operate at those vast scales. In this paper, we explore the factors that contribute most to the high-availability of cloud computing services and examine where self-stabilization could fit in that picture.
△ Less
Submitted 8 June, 2018;
originally announced June 2018.
-
Technical Report: Optimistic Execution in Key-Value Store
Authors:
Duong Nguyen,
Aleksey Charapko,
Sandeep Kulkarni,
Murat Demirbas
Abstract:
Limitations of the CAP theorem imply that if availability is desired in the presence of network partitions, one must sacrifice sequential consistency, a consistency model that is more natural for system design. We focus on the problem of what a designer should do if he/she has an algorithm that works correctly with sequential consistency but is faced with an underlying key-value store that provide…
▽ More
Limitations of the CAP theorem imply that if availability is desired in the presence of network partitions, one must sacrifice sequential consistency, a consistency model that is more natural for system design. We focus on the problem of what a designer should do if he/she has an algorithm that works correctly with sequential consistency but is faced with an underlying key-value store that provides a weaker (e.g., eventual or causal) consistency. We propose a detect-rollback based approach: The designer identifies a correctness predicate, say $P$, and continues to run the protocol, as our system monitors $P$. If $P$ is violated (because the underlying key-value store provides a weaker consistency), the system rolls back and resumes the computation at a state where $P$ holds.
We evaluate this approach with practical graph applications running on the Voldemort key-value store. Our experiments with deployment on Amazon AWS EC2 instances shows that using eventual consistency with monitoring can provide a $50-80\%$ increase in throughput when compared with sequential consistency. We also show that the overhead of the monitoring itself is low (typically less than 4\%) and the latency of detecting violations is small. In particular, more than $99.9\%$ of violations are detected in less than $50$ milliseconds in regional AWS networks, and in less than $5$ seconds in global AWS networks.
△ Less
Submitted 23 June, 2018; v1 submitted 25 May, 2018;
originally announced May 2018.
-
Optimistic Execution in Key-Value Store
Authors:
Duong Nguyen,
Aleksey Charapko,
Sandeep Kulkarni,
Murat Demirbas
Abstract:
Limitations of CAP theorem imply that if availability is desired in the presence of network partitions, one must sacrifice sequential consistency, a consistency model that is more natural for system design. We focus on the problem of what a designer should do if she has an algorithm that works correctly with sequential consistency but is faced with an underlying key-value store that provides a wea…
▽ More
Limitations of CAP theorem imply that if availability is desired in the presence of network partitions, one must sacrifice sequential consistency, a consistency model that is more natural for system design. We focus on the problem of what a designer should do if she has an algorithm that works correctly with sequential consistency but is faced with an underlying key-value store that provides a weaker (e.g., eventual or causal) consistency. We propose a detect-rollback based approach: The designer identifies a correctness predicate, say P , and continue to run the protocol, as our system monitors P . If P is violated (because the underlying key-value store provides a weaker consistency), the system rolls back and resumes the computation at a state where P holds.
We evaluate this approach in the Voldemort key-value store. Our experiments with deployment of Voldemort on Amazon AWS shows that using eventual consistency with monitoring can provide 20 - 40% increase in throughput when compared with sequential consistency. We also show that the overhead of the monitor itself is small (typically less than 8%) and the latency of detecting violations is very low. For example, more than 99.9% violations are detected in less than 1 second.
△ Less
Submitted 22 January, 2018;
originally announced January 2018.
-
Monitoring Partially Synchronous Distributed Systems using SMT Solvers
Authors:
Vidhya Tekken Valapil,
Sorrachai Yingchareonthawornchai,
Sandeep Kulkarni,
Eric Torng,
Murat Demirbas
Abstract:
In this paper, we discuss the feasibility of monitoring partially synchronous distributed systems to detect latent bugs, i.e., errors caused by concurrency and race conditions among concurrent processes. We present a monitoring framework where we model both system constraints and latent bugs as Satisfiability Modulo Theories (SMT) formulas, and we detect the presence of latent bugs using an SMT so…
▽ More
In this paper, we discuss the feasibility of monitoring partially synchronous distributed systems to detect latent bugs, i.e., errors caused by concurrency and race conditions among concurrent processes. We present a monitoring framework where we model both system constraints and latent bugs as Satisfiability Modulo Theories (SMT) formulas, and we detect the presence of latent bugs using an SMT solver. We demonstrate the feasibility of our framework using both synthetic applications where latent bugs occur at any time with random probability and an application involving exclusive access to a shared resource with a subtle timing bug. We illustrate how the time required for verification is affected by parameters such as communication frequency, latency, and clock skew. Our results show that our framework can be used for real-life applications, and because our framework uses SMT solvers, the range of appropriate applications will increase as these solvers become more efficient over time.
△ Less
Submitted 24 July, 2017;
originally announced July 2017.
-
WPaxos: Wide Area Network Flexible Consensus
Authors:
Ailidani Ailijiang,
Aleksey Charapko,
Murat Demirbas,
Tevfik Kosar
Abstract:
WPaxos is a multileader Paxos protocol that provides low-latency and high-throughput consensus across wide-area network (WAN) deployments. WPaxos uses multileaders, and partitions the object-space among these multileaders. Unlike statically partitioned multiple Paxos deployments, WPaxos is able to adapt to the changing access locality through object stealing. Multiple concurrent leaders coinciding…
▽ More
WPaxos is a multileader Paxos protocol that provides low-latency and high-throughput consensus across wide-area network (WAN) deployments. WPaxos uses multileaders, and partitions the object-space among these multileaders. Unlike statically partitioned multiple Paxos deployments, WPaxos is able to adapt to the changing access locality through object stealing. Multiple concurrent leaders coinciding in different zones steal ownership of objects from each other using phase-1 of Paxos, and then use phase-2 to commit update-requests on these objects locally until they are stolen by other leaders. To achieve fast phase-2 commits, WPaxos adopts the flexible quorums idea in a novel manner, and appoints phase-2 acceptors to be close to their respective leaders. We implemented WPaxos and evaluated it on WAN deployments across 5 AWS regions. The dynamic partitioning of the object-space and emphasis on zone-local commits allow WPaxos to significantly outperform both partitioned Paxos deployments and leaderless Paxos approaches.
△ Less
Submitted 3 April, 2019; v1 submitted 26 March, 2017;
originally announced March 2017.
-
Precision, Recall, and Sensitivity of Monitoring Partially Synchronous Distributed Systems
Authors:
Sorrachai Yingchareonthawornchai,
Duong Nguyen,
Vidhya Tekken Valapil,
Sandeep Kulkarni,
Murat Demirbas
Abstract:
Runtime verification focuses on analyzing the execution of a given program by a monitor to determine if it is likely to violate its specifications. There is often an impedance mismatch between the assumptions/model of the monitor and that of the underlying program. This constitutes problems especially for distributed systems, where the concept of current time and state are inherently uncertain. A…
▽ More
Runtime verification focuses on analyzing the execution of a given program by a monitor to determine if it is likely to violate its specifications. There is often an impedance mismatch between the assumptions/model of the monitor and that of the underlying program. This constitutes problems especially for distributed systems, where the concept of current time and state are inherently uncertain. A monitor designed with asynchronous system model assumptions may cause false-positives for a program executing in a partially synchronous system: the monitor may flag a global predicate that does not actually occur in the underlying system. A monitor designed with a partially synchronous system model assumption may cause false negatives as well as false positives for a program executing in an environment where the bounds on partial synchrony differ (albeit temporarily) from the monitor model assumptions.
In this paper we analyze the effects of the impedance mismatch between the monitor and the underlying program for the detection of conjunctive predicates. We find that there is a small interval where the monitor assumptions are hypersensitive to the underlying program environment. We provide analytical derivations for this interval, and also provide simulation support for exploring the sensitivity of predicate detection to the impedance mismatch between the monitor and the program under a partially synchronous system.
△ Less
Submitted 3 July, 2016;
originally announced July 2016.