Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.
This lecture covers advanced topics in Spark, focusing on partitioning strategies, memory optimization, and shuffle operations. It delves into the internals of Spark architecture, the cost of shuffle operations, and memory management. The instructor explains how to optimize Spark jobs by tuning partitions, avoiding shuffling, and minimizing memory usage. Additionally, the lecture explores Spark parallelization, RDDs, DataFrames, and the PySpark internals. Practical exercises and demos are included to illustrate the concepts discussed.