Lecture

Fault Tolerance and Recovery: Data Safety in Distributed Computing

Description

This lecture covers fault tolerance in distributed computing systems, focusing on data safety and job recovery. Topics include replication for data safety, HDFS architecture, job recovery in MapReduce and Spark, and the importance of lineage information. The instructor emphasizes the need to minimize effort for recovering failed jobs and mask failures to avoid user delays.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.