Fault tolerance

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation. A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely, when some part of the system fails. The term is most commonly used to describe computer systems designed to continue more or less fully operational with, perhaps, a reduction in throughput or an increase in response time in the event of some partial failure. That is, the system as a whole is not stopped due to problems either in the hardware or the software. An example in another field is a motor vehicle designed so it will continue to be drivable if one of the tires is punctured, or a structure that is able to retain its integrity in the presence of damage due to causes such as fatigue, corrosion, manufacturing flaws, or impact. Within the scope of an individual system, fault tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for self-stabilization so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a human action if humans are present in the loop.

uBFT: Microsecond-Scale BFT using Disaggregated Memory

Rachid Guerraoui, Antoine Murat, Mihail Igor Zablotchi, Athanasios Xygkis, Naama Ben David

We propose uBFT, the first State Machine Replication (SMR) system to achieve microsecond-scale latency in data centers, while using only 2f+1 replicas to tolerate f Byzantine failures. The Byzantine Fault Tolerance (BFT) provided by uBFT is essential as pu ...

2023

Byzantine Fault-Tolerance in Federated Local SGD Under 2f-Redundancy

Nirupam Gupta

In this article, we study the problem of Byzantine fault-tolerance in a federated optimization setting, where there is a group of agents communicating with a centralized coordinator. We allow up to

f

Byzantine-faulty agents, which may not follow a prescr ...

Ieee-Inst Electrical Electronics Engineers Inc2023

uBFT: Microsecond-Scale BFT using Disaggregated Memory

Rachid Guerraoui, Antoine Murat, Mihail Igor Zablotchi, Athanasios Xygkis, Naama Ben David

2023

Byzantine Fault-Tolerance in Federated Local SGD Under 2f-Redundancy

Nirupam Gupta

In this article, we study the problem of Byzantine fault-tolerance in a federated optimization setting, where there is a group of agents communicating with a centralized coordinator. We allow up to

f

Byzantine-faulty agents, which may not follow a prescr ...

Ieee-Inst Electrical Electronics Engineers Inc2023

Graph Chatbot

Planetary-Scale Byzantine Fault Tolerance

uBFT: Microsecond-Scale BFT using Disaggregated Memory

Byzantine Fault-Tolerance in Federated Local SGD Under 2f-Redundancy

Planetary-Scale Byzantine Fault Tolerance

uBFT: Microsecond-Scale BFT using Disaggregated Memory

Byzantine Fault-Tolerance in Federated Local SGD Under 2f-Redundancy