-
Spider: A BFT Architecture for Geo-Replicated Cloud Services
Authors:
Michael Eischer,
Tobias Distler
Abstract:
Traditionally, Byzantine fault tolerance (BFT) in geo-replicated systems is achieved by executing complex agreement protocols over large-distance communication links, and therefore typically incurs high response times. In this paper we address this problem with Spider, a resilient and modular BFT replication architecture for geo-distributed systems that leverages characteristic features of today's…
▽ More
Traditionally, Byzantine fault tolerance (BFT) in geo-replicated systems is achieved by executing complex agreement protocols over large-distance communication links, and therefore typically incurs high response times. In this paper we address this problem with Spider, a resilient and modular BFT replication architecture for geo-distributed systems that leverages characteristic features of today's public-cloud infrastructures to minimize both complexity as well as latency. Spider is composed of multiple largely independent replica groups that each are distributed across different availability zones of their respective cloud region. This design offers the possibility to provide low response times by placing replica groups in close geographic distance to clients, while at the same time enabling intra-group communication over short-distance links. To handle the interaction between groups that is necessary for strong consistency, Spider uses a novel message-channel abstraction with first-in-first-out semantics and built-in flow control that greatly simplifies system design.
△ Less
Submitted 18 May, 2024;
originally announced July 2024.
-
Egalitarian Byzantine Fault Tolerance
Authors:
Michael Eischer,
Tobias Distler
Abstract:
Minimizing end-to-end latency in geo-replicated systems usually makes it necessary to compromise on resilience, resource efficiency, or throughput performance, because existing approaches either tolerate only crashes, require additional replicas, or rely on a global leader for consensus. In this paper, we eliminate the need for such tradeoffs by presenting Isos, a leaderless replication protocol t…
▽ More
Minimizing end-to-end latency in geo-replicated systems usually makes it necessary to compromise on resilience, resource efficiency, or throughput performance, because existing approaches either tolerate only crashes, require additional replicas, or rely on a global leader for consensus. In this paper, we eliminate the need for such tradeoffs by presenting Isos, a leaderless replication protocol that tolerates up to $f$ Byzantine faults with a minimum of $3f+1$ replicas. To reduce latency in wide-area environments, Isos relies on an efficient consensus algorithm that allows all participating replicas to propose new requests and thereby enables clients to avoid delays by submitting requests to their nearest replica. In addition, Isos minimizes overhead by limiting message ordering to requests that conflict with each other (e.g., due to accessing the same state parts) and by already committing them after three communication steps if at least $f+1$ replicas report each conflict. Our experimental evaluation with a geo-replicated key-value store shows that these properties allow Isos to provide lower end-to-end latency than existing protocols, especially for use-case scenarios in which the clients of a system are distributed across multiple locations.
△ Less
Submitted 14 September, 2021;
originally announced September 2021.
-
Resilient Cloud-based Replication with Low Latency
Authors:
Michael Eischer,
Tobias Distler
Abstract:
Existing approaches to tolerate Byzantine faults in geo-replicated environments require systems to execute complex agreement protocols over wide-area links and consequently are often associated with high response times. In this paper we address this problem with Spider, a resilient replication architecture for geo-distributed systems that leverages the availability characteristics of today's publi…
▽ More
Existing approaches to tolerate Byzantine faults in geo-replicated environments require systems to execute complex agreement protocols over wide-area links and consequently are often associated with high response times. In this paper we address this problem with Spider, a resilient replication architecture for geo-distributed systems that leverages the availability characteristics of today's public-cloud infrastructures to minimize complexity and reduce latency. Spider models a system as a collection of loosely coupled replica groups whose members are hosted in different cloud-provided fault domains (i.e., availability zones) of the same geographic region. This structural organization makes it possible to achieve low response times by placing replica groups in close proximity to clients while still enabling the replicas of a group to interact over short-distance links. To handle the inter-group communication necessary for strong consistency Spider uses a reliable group-to-group message channel with first-in-first-out semantics and built-in flow control that significantly simplifies system design.
△ Less
Submitted 21 September, 2020;
originally announced September 2020.