Search | arXiv e-print repository

Byzantine Agreement with Predictions

Authors: Naama Ben-David, Muhammad Ayaz Dzulfikar, Faith Ellen, Seth Gilbert

Abstract: In this paper, we study the problem of \emph{Byzantine Agreement with predictions}. Along with a proposal, each process is also given a prediction, i.e., extra information which is not guaranteed to be true. For example, one might imagine that the prediction is produced by a network security monitoring service that looks for patterns of malicious behavior. Our goal is to design an algorithm that… ▽ More In this paper, we study the problem of \emph{Byzantine Agreement with predictions}. Along with a proposal, each process is also given a prediction, i.e., extra information which is not guaranteed to be true. For example, one might imagine that the prediction is produced by a network security monitoring service that looks for patterns of malicious behavior. Our goal is to design an algorithm that is more efficient when the predictions are accurate, degrades in performance as predictions decrease in accuracy, and still in the worst case performs as well as any algorithm without predictions even when the predictions are completely inaccurate. On the negative side, we show that Byzantine Agreement with predictions still requires $Ω(n^2)$ messages, even in executions where the predictions are completely accurate. On the positive side, we show that \emph{classification predictions} can help improve the time complexity. For (synchronous) Byzantine Agreement with classification predictions, we present new algorithms that leverage predictions to yield better time complexity, and we show that the time complexity achieved is optimal as a function of the prediction quality. △ Less

Submitted 3 May, 2025; originally announced May 2025.

arXiv:2308.03919 [pdf, other]

The FIDS Theorems: Tensions between Multinode and Multicore Performance in Transactional Systems

Authors: Naama Ben-David, Gal Sela, Adriana Szekeres

Abstract: Traditionally, distributed and parallel transactional systems have been studied in isolation, as they targeted different applications and experienced different bottlenecks. However, modern high-bandwidth networks have made the study of systems that are both distributed (i.e., employ multiple nodes) and parallel (i.e., employ multiple cores per node) necessary to truly make use of the available har… ▽ More Traditionally, distributed and parallel transactional systems have been studied in isolation, as they targeted different applications and experienced different bottlenecks. However, modern high-bandwidth networks have made the study of systems that are both distributed (i.e., employ multiple nodes) and parallel (i.e., employ multiple cores per node) necessary to truly make use of the available hardware. In this paper, we study the performance of these combined systems and show that there are inherent tradeoffs between a system's ability to have fast and robust distributed communication and its ability to scale to multiple cores. More precisely, we formalize the notions of a \emph{fast deciding} path of communication to commit transactions quickly in good executions, and \emph{seamless fault tolerance} that allows systems to remain robust to server failures. We then show that there is an inherent tension between these two natural distributed properties and well-known multicore scalability properties in transactional systems. Finally, we show positive results; it is possible to construct a parallel distributed transactional system if any one of the properties we study is removed. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2210.17174 [pdf, other]

uBFT: Microsecond-scale BFT using Disaggregated Memory [Extended Version]

Authors: Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Antoine Murat, Athanasios Xygkis, Igor Zablotchi

Abstract: We propose uBFT, the first State Machine Replication (SMR) system to achieve microsecond-scale latency in data centers, while using only $2f{+}1$ replicas to tolerate $f$ Byzantine failures. The Byzantine Fault Tolerance (BFT) provided by uBFT is essential as pure crashes appear to be a mere illusion with real-life systems reportedly failing in many unexpected ways. uBFT relies on a small non-tail… ▽ More We propose uBFT, the first State Machine Replication (SMR) system to achieve microsecond-scale latency in data centers, while using only $2f{+}1$ replicas to tolerate $f$ Byzantine failures. The Byzantine Fault Tolerance (BFT) provided by uBFT is essential as pure crashes appear to be a mere illusion with real-life systems reportedly failing in many unexpected ways. uBFT relies on a small non-tailored trusted computing base -- disaggregated memory -- and consumes a practically bounded amount of memory. uBFT is based on a novel abstraction called Consistent Tail Broadcast, which we use to prevent equivocation while bounding memory. We implement uBFT using RDMA-based disaggregated memory and obtain an end-to-end latency of as little as 10us. This is at least 50$\times$ faster than MinBFT , a state of the art $2f{+}1$ BFT SMR based on Intel's SGX. We use uBFT to replicate two KV-stores (Memcached and Redis), as well as a financial order matching engine (Liquibook). These applications have low latency (up to 20us) and become Byzantine tolerant with as little as 10us more. The price for uBFT is a small amount of reliable disaggregated memory (less than 1 MiB), which in our prototype consists of a small number of memory servers connected through RDMA and replicated for fault tolerance. △ Less

Submitted 16 March, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

arXiv:2208.11114 [pdf, other]

Survey of Persistent Memory Correctness Conditions

Authors: Naama Ben-David, Michal Friedman, Yuanhao Wei

Abstract: The study of concurrent persistent programs has seen a surge of activity in recent years due to the introduction of non-volatile random access memories (NVRAM), yielding many models and correctness notions that are difficult to compare. In this paper, we survey existing correctness properties for this setting, placing them into the same context and comparing them. We present a hierarchy of these p… ▽ More The study of concurrent persistent programs has seen a surge of activity in recent years due to the introduction of non-volatile random access memories (NVRAM), yielding many models and correctness notions that are difficult to compare. In this paper, we survey existing correctness properties for this setting, placing them into the same context and comparing them. We present a hierarchy of these persistence properties based on the generality of the histories they deem correct, and show how this hierarchy shifts based on different model assumptions. △ Less

Submitted 23 August, 2022; originally announced August 2022.

arXiv:2201.00813 [pdf, other]

Lock-Free Locks Revisited

Authors: Naama Ben-David, Guy E. Blelloch, Yuanhao Wei

Abstract: This paper presents a new and practical approach to lock-free locks based on helping, which allows the user to write code using fine-grained locks, but run it in a lock-free manner. Although lock-free locks have been suggested in the past, they are widely viewed as impractical, have some key limitations, and, as far as we know, have never been implemented. The paper presents some key techniques… ▽ More This paper presents a new and practical approach to lock-free locks based on helping, which allows the user to write code using fine-grained locks, but run it in a lock-free manner. Although lock-free locks have been suggested in the past, they are widely viewed as impractical, have some key limitations, and, as far as we know, have never been implemented. The paper presents some key techniques that make lock-free locks practical and more general. The most important technique is an approach to idempotence -- i.e. making code that runs multiple times appear as if it ran once. The idea is based on using a shared log among processes running the same protected code. Importantly, the approach can be library based, requiring very little if any change to standard code -- code just needs to use the idempotent versions of memory operations (load, store, LL/SC, allocation, free). We have implemented a C++ library called Flock based on the ideas. Flock allows lock-based data structures to run in either lock-free or blocking (traditional locks) mode. We implemented a variety of tree and list-based data structures with Flock and compare the performance of the lock-free and blocking modes under a variety of workloads. The lock-free mode is almost as fast as blocking mode under almost all workloads, and significantly faster when threads are oversubscribed (more threads than processors). We also compare with several existing lock-based and lock-free alternatives. △ Less

Submitted 28 January, 2022; v1 submitted 3 January, 2022; originally announced January 2022.

Comments: This is the full version of the paper appearing in PPoPP 2022

arXiv:2112.04197 [pdf, ps, other]

A Fast Algorithm for PAC Combinatorial Pure Exploration

Authors: Noa Ben-David, Sivan Sabato

Abstract: We consider the problem of Combinatorial Pure Exploration (CPE), which deals with finding a combinatorial set or arms with a high reward, when the rewards of individual arms are unknown in advance and must be estimated using arm pulls. Previous algorithms for this problem, while obtaining sample complexity reductions in many cases, are highly computationally intensive, thus making them impractical… ▽ More We consider the problem of Combinatorial Pure Exploration (CPE), which deals with finding a combinatorial set or arms with a high reward, when the rewards of individual arms are unknown in advance and must be estimated using arm pulls. Previous algorithms for this problem, while obtaining sample complexity reductions in many cases, are highly computationally intensive, thus making them impractical even for mildly large problems. In this work, we propose a new CPE algorithm in the PAC setting, which is computationally light weight, and so can easily be applied to problems with tens of thousands of arms. This is achieved since the proposed algorithm requires a very small number of combinatorial oracle calls. The algorithm is based on successive acceptance of arms, along with elimination which is based on the combinatorial structure of the problem. We provide sample complexity guarantees for our algorithm, and demonstrate in experiments its usefulness on large problems, whereas previous algorithms are impractical to run on problems of even a few dozen arms. The code for the algorithms and experiments is provided at https://github.com/noabdavid/csale. △ Less

Submitted 8 December, 2021; originally announced December 2021.

Comments: Full version of paper accepted to AAAI-22

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 36(6), 6064--6071, 2022

arXiv:2108.04520 [pdf, ps, other]

Fast and Fair Randomized Wait-Free Locks

Authors: Naama Ben-David, Guy E. Blelloch

Abstract: We present a randomized approach for wait-free locks with strong bounds on time and fairness in a context in which any process can be arbitrarily delayed. Our approach supports a tryLock operation that is given a set of locks, and code to run when all the locks are acquired. A tryLock operation, or attempt, may fail if there is contention on the locks, in which case the code is not run. Given an u… ▽ More We present a randomized approach for wait-free locks with strong bounds on time and fairness in a context in which any process can be arbitrarily delayed. Our approach supports a tryLock operation that is given a set of locks, and code to run when all the locks are acquired. A tryLock operation, or attempt, may fail if there is contention on the locks, in which case the code is not run. Given an upper bound $κ$ known to the algorithm on the point contention of any lock, and an upper bound $L$ on the number of locks in a tryLock's set, a tryLock will succeed in acquiring its locks and running the code with probability at least $1/(κL)$. It is thus fair. Furthermore, if the maximum step complexity for the code in any lock is $T$, the attempt will take $O(κ^2 L^2 T)$ steps, regardless of whether it succeeds or fails. The attempts are independent, thus if the tryLock is repeatedly retried on failure, it will succeed in $O(κ^3 L^3 T)$ expected steps, and with high probability in not much more. △ Less

Submitted 28 October, 2022; v1 submitted 10 August, 2021; originally announced August 2021.

arXiv:2108.04202 [pdf, other]

FliT: A Library for Simple and Efficient Persistent Algorithms

Authors: Yuanhao Wei, Naama Ben-David, Michal Friedman, Guy E. Blelloch, Erez Petrank

Abstract: Non-volatile random access memory (NVRAM) offers byte-addressable persistence at speeds comparable to DRAM. However, with caches remaining volatile, automatic cache evictions can reorder updates to memory, potentially leaving persistent memory in an inconsistent state upon a system crash. Flush and fence instructions can be used to force ordering among updates, but are expensive. This has motivate… ▽ More Non-volatile random access memory (NVRAM) offers byte-addressable persistence at speeds comparable to DRAM. However, with caches remaining volatile, automatic cache evictions can reorder updates to memory, potentially leaving persistent memory in an inconsistent state upon a system crash. Flush and fence instructions can be used to force ordering among updates, but are expensive. This has motivated significant work studying how to write correct and efficient persistent programs for NVRAM. In this paper, we present FliT, a C++ library that facilitates writing efficient persistent code. Using the library's default mode makes any linearizable data structure durable with minimal changes to the code. FliT avoids many redundant flush instructions by using a novel algorithm to track dirty cache lines. The FliT library also allows for extra optimizations, but achieves good performance even in its default setting. To describe the FliT library's capabilities and guarantees, we define a persistent programming interface, called the P-V Interface, which FliT implements. The P-V Interface captures the expected behavior of code in which some instructions' effects are persisted and some are not. We show that the interface captures the desired semantics of many practical algorithms in the literature. We apply the FliT library to four different persistent data structures, and show that across several workloads, persistence implementations, and data structure sizes, the FliT library always improves operation throughput, by at least $2.1\times$ over a naive implementation in all but one workload. △ Less

Submitted 18 August, 2021; v1 submitted 9 August, 2021; originally announced August 2021.

arXiv:2108.02775 [pdf, other]

doi 10.4230/LIPIcs.DISC.2021.12

Space and Time Bounded Multiversion Garbage Collection

Authors: Naama Ben-David, Guy E. Blelloch, Panagiota Fatourou, Eric Ruppert, Yihan Sun, Yuanhao Wei

Abstract: We present a general technique for garbage collecting old versions for multiversion concurrency control that simultaneously achieves good time and space complexity. Our technique takes only $O(1)$ time on average to reclaim each version and maintains only a constant factor more versions than needed (plus an additive term). It is designed for multiversion schemes using version lists, which are the… ▽ More We present a general technique for garbage collecting old versions for multiversion concurrency control that simultaneously achieves good time and space complexity. Our technique takes only $O(1)$ time on average to reclaim each version and maintains only a constant factor more versions than needed (plus an additive term). It is designed for multiversion schemes using version lists, which are the most common. Our approach uses two components that are of independent interest. First, we define a novel range-tracking data structure which stores a set of old versions and efficiently finds those that are no longer needed. We provide a wait-free implementation in which all operations take amortized constant time. Second, we represent version lists using a new lock-free doubly-linked list algorithm that supports efficient (amortized constant time) removals given a pointer to any node in the list. These two components naturally fit together to solve the multiversion garbage collection problem--the range-tracker identifies which versions to remove and our list algorithm can then be used to remove them from their version lists. We apply our garbage collection technique to generate end-to-end time and space bounds for the multiversioning system of Wei et al. (PPoPP 2021). △ Less

Submitted 16 December, 2021; v1 submitted 5 August, 2021; originally announced August 2021.

Comments: This is the full version of the paper appearing in the International Symposium on Distributed Computing (DISC 2021)

arXiv:2108.01330 [pdf, other]

Frugal Byzantine Computing

Authors: M. K. Aguilera, N. Ben-David, R. Guerraoui, D. Papuc, A. Xygkis, I. Zablotchi

Abstract: Traditional techniques for handling Byzantine failures are expensive: digital signatures are too costly, while using $3f{+}1$ replicas is uneconomical ($f$ denotes the maximum number of Byzantine processes). We seek algorithms that reduce the number of replicas to $2f{+}1$ and minimize the number of signatures. While the first goal can be achieved in the message-and-memory model, accomplishing the… ▽ More Traditional techniques for handling Byzantine failures are expensive: digital signatures are too costly, while using $3f{+}1$ replicas is uneconomical ($f$ denotes the maximum number of Byzantine processes). We seek algorithms that reduce the number of replicas to $2f{+}1$ and minimize the number of signatures. While the first goal can be achieved in the message-and-memory model, accomplishing the second goal simultaneously is challenging. We first address this challenge for the problem of broadcasting messages reliably. We consider two variants of this problem, Consistent Broadcast and Reliable Broadcast, typically considered very close. Perhaps surprisingly, we establish a separation between them in terms of signatures required. In particular, we show that Consistent Broadcast requires at least 1 signature in some execution, while Reliable Broadcast requires $O(n)$ signatures in some execution. We present matching upper bounds for both primitives within constant factors. We then turn to the problem of consensus and argue that this separation matters for solving consensus with Byzantine failures: we present a practical consensus algorithm that uses Consistent Broadcast as its main communication primitive. This algorithm works for $n=2f{+}1$ and avoids signatures in the common-case -- properties that have not been simultaneously achieved previously. Overall, our work approaches Byzantine computing in a frugal manner and motivates the use of Consistent Broadcast -- rather than Reliable Broadcast -- as a key primitive for reaching agreement. △ Less

Submitted 3 August, 2021; originally announced August 2021.

Comments: This paper is an extended version of the DISC 2021 paper

arXiv:2105.10566 [pdf, other]

Classifying Trusted Hardware via Unidirectional Communication

Authors: Naama Ben-David, Kartik Nayak

Abstract: It is well known that Byzantine fault tolerant (BFT) consensus cannot be solved in the classic asynchronous message passing model when one-third or more of the processes may be faulty. Since many modern applications require higher fault tolerance, this bound has been circumvented by introducing non-equivocation mechanisms that prevent Byzantine processes from sending conflicting messages to other… ▽ More It is well known that Byzantine fault tolerant (BFT) consensus cannot be solved in the classic asynchronous message passing model when one-third or more of the processes may be faulty. Since many modern applications require higher fault tolerance, this bound has been circumvented by introducing non-equivocation mechanisms that prevent Byzantine processes from sending conflicting messages to other processes. The use of trusted hardware is a way to implement non-equivocation. Several different trusted hardware modules have been considered in the literature. In this paper, we study whether all trusted hardware modules are equivalent in the power that they provide to a system. We show that while they do all prevent equivocation, we can partition trusted hardware modules into two different power classes; those that employ shared memory primitives, and those that do not. We separate these classes using a new notion we call unidirectionality, which describes a useful guarantee on the ability of processes to prevent network partitions. We show that shared-memory based hardware primitives provide unidirectionality, while others do not. △ Less

Submitted 21 May, 2021; originally announced May 2021.

arXiv:2103.13796 [pdf, other]

Active Structure Learning of Bayesian Networks in an Observational Setting

Authors: Noa Ben-David, Sivan Sabato

Abstract: We study active structure learning of Bayesian networks in an observational setting, in which there are external limitations on the number of variable values that can be observed from the same sample. Random samples are drawn from the joint distribution of the network variables, and the algorithm iteratively selects which variables to observe in the next sample. We propose a new active learning al… ▽ More We study active structure learning of Bayesian networks in an observational setting, in which there are external limitations on the number of variable values that can be observed from the same sample. Random samples are drawn from the joint distribution of the network variables, and the algorithm iteratively selects which variables to observe in the next sample. We propose a new active learning algorithm for this setting, that finds with a high probability a structure with a score that is $ε$-close to the optimal score. We show that for a class of distributions that we term stable, a sample complexity reduction of up to a factor of $\widetildeΩ(d^3)$ can be obtained, where $d$ is the number of network variables. We further show that in the worst case, the sample complexity of the active algorithm is guaranteed to be almost the same as that of a naive baseline algorithm. To supplement the theoretical results, we report experiments that compare the performance of the new active algorithm to the naive baseline and demonstrate the sample complexity improvements. Code for the algorithm and for the experiments is provided at https://github.com/noabdavid/activeBNSL. △ Less

Submitted 17 July, 2022; v1 submitted 25 March, 2021; originally announced March 2021.

Journal ref: Journal of Machine Learning Research, 23(188):1--38, 2022

arXiv:2010.06288 [pdf, other]

Microsecond Consensus for Microsecond Applications

Authors: Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra J. Marathe, Athanasios Xygkis, Igor Zablotchi

Abstract: We consider the problem of making apps fault-tolerant through replication, when apps operate at the microsecond scale, as in finance, embedded computing, and microservices apps. These apps need a replication scheme that also operates at the microsecond scale, otherwise replication becomes a burden. We propose Mu, a system that takes less than 1.3 microseconds to replicate a (small) request in memo… ▽ More We consider the problem of making apps fault-tolerant through replication, when apps operate at the microsecond scale, as in finance, embedded computing, and microservices apps. These apps need a replication scheme that also operates at the microsecond scale, otherwise replication becomes a burden. We propose Mu, a system that takes less than 1.3 microseconds to replicate a (small) request in memory, and less than a millisecond to fail-over the system - this cuts the replication and fail-over latencies of the prior systems by at least 61% and 90%. Mu implements bona fide state machine replication/consensus (SMR) with strong consistency for a generic app, but it really shines on microsecond apps, where even the smallest overhead is significant. To provide this performance, Mu introduces a new SMR protocol that carefully leverages RDMA. Roughly, in Mu a leader replicates a request by simply writing it directly to the log of other replicas using RDMA, without any additional communication. Doing so, however, introduces the challenge of handling concurrent leaders, changing leaders, garbage collecting the logs, and more - challenges that we address in this paper through a judicious combination of RDMA permissions and distributed algorithmic design. We implemented Mu and used it to replicate several systems: a financial exchange app called Liquibook, Redis, Memcached, and HERD. Our evaluation shows that Mu incurs a small replication latency, in some cases being the only viable replication system that incurs an acceptable overhead. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: Full version of OSDI'20 paper

arXiv:2007.02372 [pdf, other]

Constant-Time Snapshots with Applications to Concurrent Data Structures

Authors: Yuanhao Wei, Naama Ben-David, Guy E. Blelloch, Panagiota Fatourou, Eric Ruppert, Yihan Sun

Abstract: We present an approach for efficiently taking snapshots of the state of a collection of CAS objects. Taking a snapshot allows later operations to read the value that each CAS object had at the time the snapshot was taken. Taking a snapshot requires a constant number of steps and returns a handle to the snapshot. Reading a snapshotted value of an individual CAS object using this handle is wait-free… ▽ More We present an approach for efficiently taking snapshots of the state of a collection of CAS objects. Taking a snapshot allows later operations to read the value that each CAS object had at the time the snapshot was taken. Taking a snapshot requires a constant number of steps and returns a handle to the snapshot. Reading a snapshotted value of an individual CAS object using this handle is wait-free, taking time proportional to the number of successful CASes on the object since the snapshot was taken. Our fast, flexible snapshots yield simple, efficient implementations of atomic multi-point queries on concurrent data structures built from CAS objects. For example, in a search tree where child pointers are updated using CAS, once a snapshot is taken, one can atomically search for ranges of keys, find the first key that matches some criteria, or check if a collection of keys are all present, simply by running a standard sequential algorithm on a snapshot of the tree. To evaluate the performance of our approach, we apply it to two search trees, one balanced and one not. Experiments show that the overhead of supporting snapshots is low across a variety of workloads. Moreover, in almost all cases, range queries on the trees built from our snapshots perform as well as or better than state-of-the-art concurrent data structures that support atomic range queries. △ Less

Submitted 30 December, 2020; v1 submitted 5 July, 2020; originally announced July 2020.

Comments: To appear in PPoPP'21

arXiv:2004.02841 [pdf, other]

NVTraverse: In NVRAM Data Structures, the Destination is More Important than the Journey

Authors: Michal Friedman, Naama Ben-David, Yuanhao Wei, Guy E. Blelloch, Erez Petrank

Abstract: The recent availability of fast, dense, byte-addressable non-volatile memory has led to increasing interest in the problem of designing and specifying durable data structures that can recover from system crashes. However, designing durable concurrent data structures that are efficient and also satisfy a correctness criterion has proven to be very difficult, leading many algorithms to be inefficien… ▽ More The recent availability of fast, dense, byte-addressable non-volatile memory has led to increasing interest in the problem of designing and specifying durable data structures that can recover from system crashes. However, designing durable concurrent data structures that are efficient and also satisfy a correctness criterion has proven to be very difficult, leading many algorithms to be inefficient or incorrect in a concurrent setting. In this paper, we present a general transformation that takes a lock-free data structure from a general class called traversal data structure (that we formally define) and automatically transforms it into an implementation of the data structure for the NVRAM setting that is provably durably linearizable and highly efficient. The transformation hinges on the observation that many data structure operations begin with a traversal phase that does not need to be persisted, and thus we only begin persisting when the traversal reaches its destination. We demonstrate the transformation's efficiency through extensive measurements on a system with Intel's recently released Optane DC persistent memory, showing that it can outperform competitors on many workloads. △ Less

Submitted 24 November, 2021; v1 submitted 6 April, 2020; originally announced April 2020.

arXiv:1905.12143 [pdf, other]

The Impact of RDMA on Agreement

Authors: Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra Marathe, Igor Zablotchi

Abstract: Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the well-known problem of achieving consensus despite failures, and find that RDMA can improve the in… ▽ More Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the well-known problem of achieving consensus despite failures, and find that RDMA can improve the inherent trade-off in distributed computing between failure resilience and performance. Specifically, we show that RDMA allows algorithms that simultaneously achieve high resilience and high performance, while traditional algorithms had to choose one or another. With Byzantine failures, we give an algorithm that only requires $n \geq 2f_P + 1$ processes (where $f_P$ is the maximum number of faulty processes) and decides in two (network) delays in common executions. With crash failures, we give an algorithm that only requires $n \geq f_P + 1$ processes and also decides in two delays. Both algorithms tolerate a minority of memory failures inherent to RDMA, and they provide safety in asynchronous systems and liveness with standard additional assumptions. △ Less

Submitted 25 February, 2021; v1 submitted 28 May, 2019; originally announced May 2019.

Comments: Full version of PODC'19 paper, strengthened broadcast algorithm

arXiv:1806.04780 [pdf, other]

Delay-Free Concurrency on Faulty Persistent Memory

Authors: Naama Ben-David, Guy E. Blelloch, Michal Friedman, Yuanhao Wei

Abstract: Non-volatile memory (NVM) promises persistent main memory that remains correct despite loss of power. This has sparked a line of research into algorithms that can recover from a system crash. Since caches are expected to remain volatile, concurrent data structures and algorithms must be redesigned to guarantee that they are left in a consistent state after a system crash, and that the execution ca… ▽ More Non-volatile memory (NVM) promises persistent main memory that remains correct despite loss of power. This has sparked a line of research into algorithms that can recover from a system crash. Since caches are expected to remain volatile, concurrent data structures and algorithms must be redesigned to guarantee that they are left in a consistent state after a system crash, and that the execution can be continued upon recovery. However, the prospect of redesigning every concurrent data structure or algorithm before it can be used in NVM architectures is daunting. In this paper, we present a construction that takes any concurrent program with reads, writes and CASs to shared memory and makes it persistent, i.e., can be continued after one or more processes fault and have to restart. Importantly the converted algorithm has constant computational delay (preserves instruction counts on each process within a constant factor), as well as constant recovery delay (a process can recover from a fault in a constant number of instructions). We show this first for a simple transformation, and then present optimizations to make it more practical, allowing for a tradeoff for better constant factors in computational delay, for sometimes increased recovery delay. We also provide an optimized transformation that works for any normalized lock-free data structure, thus allowing more efficient constructions for a large class of concurrent algorithms. We experimentally evaluate our transformations by applying them to a queue. △ Less

Submitted 18 June, 2020; v1 submitted 12 June, 2018; originally announced June 2018.

arXiv:1803.08617 [pdf, other]

doi 10.1145/3323165.3323185

Multiversion Concurrency with Bounded Delay and Precise Garbage Collection

Authors: Naama Ben-David, Guy E. Blelloch, Yihan Sun, Yuanhao Wei

Abstract: In this paper we are interested in bounding the number of instructions taken to process transactions. The main result is a multiversion transactional system that supports constant delay (extra instructions beyond running in isolation) for all read-only transactions, delay equal to the number of processes for writing transactions that are not concurrent with other writers, and lock-freedom for conc… ▽ More In this paper we are interested in bounding the number of instructions taken to process transactions. The main result is a multiversion transactional system that supports constant delay (extra instructions beyond running in isolation) for all read-only transactions, delay equal to the number of processes for writing transactions that are not concurrent with other writers, and lock-freedom for concurrent writers. The system supports precise garbage collection in that versions are identified for collection as soon as the last transaction releases them. As far as we know these are first results that bound delays for multiple readers and even a single writer. The approach is particularly useful in situations where read-transactions dominate write transactions, or where write transactions come in as streams or batches and can be processed by a single writer (possibly in parallel). The approach is based on using functional data structures to support multiple versions, and an efficient solution to the Version Maintenance (VM) problem for acquiring, updating and releasing versions. Our solution to the VM problem is precise, safe and wait-free (PSWF). We experimentally validate our approach by applying it to balanced tree data structures for maintaining ordered maps. We test the transactional system using multiple algorithms for the VM problem, including our PSWF VM algorithm, and implementations with weaker guarantees based on epochs, hazard pointers, and read-copy-update. To evaluate the functional data structure for concurrency and multi-versioning, we implement batched updates for functional tree structures and compare the performance with state-of-the-art concurrent data structures for balanced trees. The experiments indicate our approach works well in practice over a broad set of criteria. △ Less

Submitted 15 May, 2019; v1 submitted 22 March, 2018; originally announced March 2018.

arXiv:1710.02637 [pdf, other]

Implicit Decomposition for Write-Efficient Connectivity Algorithms

Authors: Naama Ben-David, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Yan Gu, Charles McGuffey, Julian Shun

Abstract: The future of main memory appears to lie in the direction of new technologies that provide strong capacity-to-performance ratios, but have write operations that are much more expensive than reads in terms of latency, bandwidth, and energy. Motivated by this trend, we propose sequential and parallel algorithms to solve graph connectivity problems using significantly fewer writes than conventional a… ▽ More The future of main memory appears to lie in the direction of new technologies that provide strong capacity-to-performance ratios, but have write operations that are much more expensive than reads in terms of latency, bandwidth, and energy. Motivated by this trend, we propose sequential and parallel algorithms to solve graph connectivity problems using significantly fewer writes than conventional algorithms. Our primary algorithmic tool is the construction of an $o(n)$-sized "implicit decomposition" of a bounded-degree graph $G$ on $n$ nodes, which combined with read-only access to $G$ enables fast answers to connectivity and biconnectivity queries on $G$. The construction breaks the linear-write "barrier", resulting in costs that are asymptotically lower than conventional algorithms while adding only a modest cost to querying time. For general non-sparse graphs on $m$ edges, we also provide the first $o(m)$ writes and $O(m)$ operations parallel algorithms for connectivity and biconnectivity. These algorithms provide insight into how applications can efficiently process computations on large graphs in systems with read-write asymmetry. △ Less

Submitted 7 October, 2017; originally announced October 2017.

Showing 1–19 of 19 results for author: Ben-David, N