Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
This lecture covers fault tolerance in distributed computing systems, focusing on data safety and job recovery. Topics include replication for data safety, HDFS architecture, job recovery in MapReduce and Spark, and the importance of lineage information. The instructor emphasizes the need to minimize effort for recovering failed jobs and mask failures to avoid user delays.