In a distributed computing system, a failure detector is a computer application or a subsystem that is responsible for the detection of node failures or crashes. Failure detectors were first introduced in 1996 by Chandra and Toueg in their book Unreliable Failure Detectors for Reliable Distributed Systems. The book depicts the failure detector as a tool to improve consensus (the achievement of reliability) and atomic broadcast (the same sequence of messages) in the distributed system. In other words, failure detectors seek errors in the process, and the system will maintain a level of reliability. In practice, after failure detectors spot crashes, the system will ban the processes that are making mistakes to prevent any further serious crashes or errors. In the 21st century, failure detectors are widely used in distributed computing systems to detect application errors, such as a software application stops functioning properly. As the distributed computing projects (see List of distributed computing projects) become more and more popular, the usage of the failure detects also becomes important and critical. Chandra and Toueg, the co-authors of the book Unreliable Failure Detectors for Reliable Distributed Systems (1996), approached the concept of detecting failure nodes by introducing the unreliable failure detector. They describe the behavior of a unreliable failure detector in a distributed computing system as: after each process in the system entered a local failure detector component, each local component will examine a portion of all processes within the system. In addition, each process must also contain programs that are currently suspected by failure detectors. Chandra and Toueg claimed that an unreliable failure detector can still be reliable in detecting the errors made by the system. They generalize unreliable failure detectors to all forms of failure detectors because unreliable failure detectors and failure detectors share the same properties.
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related publications (46)

On the Hardness of the Strongly Dependent Decision Problem

Martin Biely

We present necessary and sufficient conditions for solving the strongly dependent decision (SDD) problem in various distributed systems. Our main contribution is a novel characterization of the SDD pr

Reliable and Real-Time Distributed Abstractions

David Kozhaya

The celebrated distributed computing approach for building systems and services using multiple machines continues to expand to new domains. Computation devices nowadays have additional sensing and com

Right On Time Distributed Shared Memory

Rachid Guerraoui, David Kozhaya, Yvonne Anne Pignolet-Oswald

The demand for real-time data storage in distributed control systems (DCSs) is growing. Yet, providing real- time DCS guarantees is challenging, especially when more and more sensor and actuator devic
Show more