Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
This lecture covers the challenges posed by big data, the growth of data sources, and the limitations of single-machine processing. It introduces the concept of RDDs in Spark, explaining their distribution over clusters and parallel processing. The instructor discusses the hardware requirements for big data, emphasizing the use of budget hardware and the issues related to failures and network latency. The lecture also explores the MapReduce paradigm, explaining how work is divided across machines and how failures are handled. Additionally, it covers the basics of RDD transformations and actions, as well as the importance of lazy execution and RDD persistence. The use of broadcast variables, accumulators, and Spark DataFrames is also highlighted.