Publication

Abstractions for asynchronous distributed computing with malicious players

Marko Vukolic
2008
EPFL thesis
Abstract

In modern distributed systems, failures are the norm rather than the exception. In many cases, these failures are not benign. Settings such as the Internet might incur malicious (also called Byzantine or arbitrary) behavior and asynchrony. As a result, and perhaps not surprisingly, research on asynchronous Byzantine fault-tolerant (BFT) distributed systems is flourishing. Tolerating arbitrary behavior and asynchrony calls for very sophisticated algorithms. This is in particular the case with BFT solutions that aim to provide properties such as: (a) optimal resilience, i.e., tolerating as many Byzantine failures as possible and (b) optimal performance with respect to some relevant complexity metric. Most BFT algorithms are built from scratch or by modifying existing solutions in a non-modular manner, which often renders these algorithms difficult to understand and, consequently, impedes their wider adoption. We attribute this complexity to the lack of sufficient number of adequate abstractions for asynchronous BFT distributed computing. The motivation of this thesis is to propose reusable abstractions for devising asynchronous BFT distributed algorithms that are optimally resilient and/or have optimal complexity, with strong focus on one of the most important complexity metrics — time complexity (or latency). The abstractions proposed in this thesis are devised with three fundamental distributed applications in mind: (a) read/write storage (also called register), (b) consensus and (c) state machine replication (SMR). We demonstrate how to use our abstractions in these applications to devise asynchronous BFT algorithms that feature the best complexity among all algorithms we know of, in addition to optimal resilience. First, we introduce the notion of a refined quorum system (RQS) of some set S as a set of three classes of subsets (quorums) of S: first class quorums are also second class quorums, themselves being also third class quorums. First class quorums have large intersections with all other quorums, second class quorums typically have smaller intersections with those of the third class, the latter simply correspond to traditional quorums. The refined quorum system abstraction helps design algorithms that tolerate contention (process concurrency), arbitrarily long periods of asynchrony and the largest possible number of failures, but perform fast if few failures occur, the system is synchronous and there is no contention, i.e., under conditions that are assumed to be frequent in practice. In other words, RQS helps combine optimal resilience and optimal best-case time complexity. Intuitively, under uncontended and synchronous conditions, a distributed object implementation would expedite an operation if a quorum of the first class is accessed, then degrade gracefully depending on whether a quorum of the second or the third class is accessed. Our notion of RQS is devised assuming a general adversary structure, and this basically allows algorithms relying on RQS to relax the assumption of independent process failures. We illustrate the power of refined quorums by introducing two new optimal BFT atomic object implementations: an atomic storage and consensus algorithm. Our second abstraction is a novel timestamping mechanism called high resolution timestamps (HRts), which can be seen as a variation of a matrix clocks. Roughly speaking, a high resolution timestamp contains a matrix of local timestamps of (a subset of) processes as seen by (a subset of) other processes. Complementary to RQS, HRts simplify the design of BFT distributed algorithms that combine optimal resilience and worst-case time complexity. We apply high-resolution timestamps to design read/write storage algorithms in which HRts are used to detect and filter out Byzantine processes, which paves the path to the first BFT storage algorithms that combine optimal resilience with optimal worst-case time complexity. Finally, we introduce ABsTRACT (Abortable Byzantine faulT-toleRant stAte maChine replicaTion), a generic abstraction that simplifies the notoriously difficult task of developing BFT state machine replication algorithms. ABsTRACT resembles BFT-SMR and it can be used to make any shared service Byzantine fault-tolerant, with one exception: it may sometimes abort a client request. The non-triviality condition under which ABsTRACT cannot abort is a generic parameter. We view a BFT-SMR algorithm as a composition of instances of ABsTRACT, each instance developed and analyzed independently. To illustrate our approach, we describe two new optimally resilient BFT algorithms. The first, that makes use of our refined quorums, has the lowest time complexity among all BFT-SMR algorithms we know of, in synchronous periods that are free from contention and failures. The second algorithm has the highest peak throughput in failure-free and synchronous periods; this algorithm argues for general applicability of ABsTRACT in developing BFT shared services that feature optimal complexity, beyond the time complexity metric.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (46)
Byzantine fault
A Byzantine fault (also Byzantine generals problem, interactive consistency, source congruency, error avalanche, Byzantine agreement problem, and Byzantine failure) is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine generals problem", developed to describe a situation in which, to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.
Computational complexity
In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations) and memory storage requirements. The complexity of a problem is the complexity of the best algorithms that allow solving the problem. The study of the complexity of explicitly given algorithms is called analysis of algorithms, while the study of the complexity of problems is called computational complexity theory.
Consensus (computer science)
A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires coordinating processes to reach consensus, or agree on some data value that is needed during computation. Example applications of consensus include agreeing on what transactions to commit to a database in which order, state machine replication, and atomic broadcasts.
Show more
Related publications (188)

Planetary-Scale Byzantine Fault Tolerance

Matteo Monti

The scale and pervasiveness of the Internet make it a pillar of planetary communication, industry and economy, as well as a fundamental medium for public discourse and democratic engagement. In stark contrast with the Internet's decentralized infrastructur ...
EPFL2024

Byzantine consensus is Θ(n^2): the Dolev-Reischuk bound is tight even in partial synchrony!

Rachid Guerraoui, Jovan Komatovic, Pierre Philippe Civit, Manuel José Ribeiro Vidigueira, Vincent Gramoli, Seth Gilbert

The Dolev-Reischuk bound says that any deterministic Byzantine consensus protocol has (at least) quadratic (in the number of processes) communication complexity in the worst case: given a system with n processes and at most f < n/3 failures, any solution t ...
2023

Byzantine-Resilient Learning Beyond Gradients: Distributing Evolutionary Search

Rachid Guerraoui, Andrei Kucharavy, Matteo Monti

Modern machine learning (ML) models are capable of impressive performances. However, their prowess is not due only to the improvements in their architecture and training algorithms but also to a drastic increase in computational power used to train them.|S ...
New York2023
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.