Publication# The Heard-Of Model: Computing in Distributed Systems with Benign Failures

Résumé

Problems in fault-tolerant distributed computing have been studied in a variety of models. These models are structured around two central ideas: Degree of synchrony and failure model are two independent parameters that determine a particular type of system. The notion of faulty component is helpful and even necessary for the analysis of distributed computations when failures occur. In this work, we question these two basic principles of fault-tolerant distributed computing, and show that it is both possible and worthy to renounce them in the context of benign failures: we present a computational model, suitable for systems with benign failures, which is based only on the notion of transmission failure. In this model, computations evolve in rounds, and messages missed at a round are lost. Only information transmission is represented: for each round r and each process p, our model provides the set of processes that "hears of'' at round r (heard-of set) namely the processes from which p receives some message at round r. The features of a specific system are thus captured as a whole, just by a predicate over the collection of heard-of sets. We show that our model handles benign failures, be they static or dynamic, permanent or transient, in a unified framework. Using this new approach, we are able to give shorter and simpler proofs of important results (non-solvability, lower bounds). In particular, we prove that in general, Consensus cannot be solved without an implicit and permanent consensus on heard-of sets. We also examine Consensus algorithms in our model. In light of this specific agreement problem, we show how our approach allows us to devise new interesting solutions.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Concepts associés

Chargement

Publications associées

Chargement

Publications associées (4)

Chargement

Chargement

Chargement

Concepts associés (14)

Tolérance aux pannes

vignette|Fichier GIF animé de 8 algorithmes ECT dans un réseau 802.1aq. La source est surlignée en violet, la destination en jaune. Les lignes violettes sont des chemins entre la source et la destinat

Système

Un système est un ensemble d' interagissant entre eux selon certains principes ou règles. Par exemple une molécule, le système solaire, une ruche, une société humaine, un parti, une armée etc.
Un s

Message

Un message est un ensemble de signes. Il implique donc un codage par l'émetteur, et un décodage par le récepteur (d'où la nécessité d'un code commun). La théorie de l'information fut mise au point po

Total order multicast algorithms constitute an important class of problems in distributed systems, especially in the context of fault-tolerance. In short, the problem of total order multicast consists in sending messages to a set of processes, in such a way that all messages are delivered by all correct destinations in the same order. However, the huge amount of literature on the subject and the plethora of solutions proposed so far make it difficult for practitioners to select a solution adapted to their specific problem. As a result, naive solutions are often used while better solutions are ignored. This paper proposes a classification of total order multicast algorithms based on the ordering mechanism of the algorithms, and describes a set of common characteristics (e.g., assumptions, properties) with which to evaluate them. In this classification, more than fifty total order broadcast and multicast algorithms are surveyed. The presentation includes asynchronous algorithms as well as algorithms based on the more restrictive synchronous model. Fault-tolerance issues are also considered as the paper studies the properties and behavior of the different algorithms with respect to failures.

2000Total order multicast algorithms constitute an important class of problems in distributed systems, especially in the context of fault-tolerance. In short, the problem of total order multicast consists in sending messages to a set of processes, in such a way that all messages are delivered by all correct destinations in the same order. However, the huge amount of literature on the subject and the plethora of solutions proposed so far make it difficult for practitioners to select a solution adapted to their specific problem. As a result, naive olutions are often used while better solutions are ignored. This paper proposes a classification of total order multicast algorithms based on the ordering mechanism of the algorithms, and describes a set of common characteristics (e.g., assumptions, properties) with which to evaluate them. In this classification, more than fifty total order broadcast and multicast algorithms are surveyed. The presentation includes asynchronous algorithms as well as algorithms based on the more restrictive synchronous model. Fault-tolerance issues are also considered as the paper studies the properties and behavior of the different algorithms with respect to failures. KEYWORDS: Distributed systems, agreement problems, message passing, total ordering, global ordering, atomic multicast, atomic broadcast, classification, taxonomy

2000Message logging and check pointing can provide fault tolerance in distributed systems in which all process communication is through messages. This paper presents a general model for reasoning about recovery in these systems. Using this model_ we prove that the set of recoverable system states that have occurred during any single execution of the system forms a lattice, and that therefore, there is always a unique maximum recoverable system state, which never decreases. Based on this model, we present an algorithm for determining this maximum recoverable state, and prove its correctness. Our algorithm utilizes all logged messages and checkpoints, and thus always finds the maximum recoverable state possible. Previous recovery methods using optimistic message logging and checkpointing have not considered the existing checkpoints, and thus may not find this maximum state. Furthermore, by utilizing the checkpoints, some messages received by a process before it was checkpointed may not need to be logged. Using our algorithm also adds less communication overhead to the system than do previous methods. Our model and algorithm can be used with any message logging protocol, whether pessimistic or optimistic, but their full generality is only required with optimistic logging protocols.

1990