Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
Graph Chatbot
Chat with Graph Search
Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.
DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.
Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore drive the MTBF of such clusters to unacceptable levels. The software framewo ...
Institute of Electrical and Electronics Engineers Computer Society, Piscataway, NJ 08855-1331, United States2005
The perfectly-synchronized round-based model provides the powerful abstraction of crash-stop failures with atomic and synchronous message delivery. This abstraction makes distributed programming very easy.We describe a technique to automatically transform ...
Area and power constrained edge devices are increasingly utilized to perform compute intensive workloads, necessitating increasingly area and power efficient accelerators. In this context, in-SRAM computing performs hundreds of parallel operations on spati ...
Sender-based message logging, a low-overhead mechanism for providing transparent fault-tolerance in distributed systems, is described. It differs from conventional message logging mechanisms in that each message is logged in volatile memory on the machine ...
Message logging and check pointing can provide fault tolerance in distributed systems in which all process communication is through messages. This paper presents a general model for reasoning about recovery in these systems. Using this model_ we prove that ...
This paper presents the processreplication protocol of Manetho, a system whose goal is to provide efficient, applicationtransparent fault tolerance to longrunning distributed computations. Manetho uses a new negative acknowledgment multicast protocol t ...