Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
This lecture discusses the challenges of minimizing job completion time in distributed computing, focusing on data skew issues and their impact on performance. It explores the implications of skewed data distribution on reducers, the limitations of standard approaches, and the optimization goals to enhance efficiency. The presentation covers execution models like MapReduce and Spark, emphasizing the importance of parallelism and efficient processing. Various algorithms for theta-joins are examined, including the 1-Bucket-Theta algorithm, highlighting the benefits of randomization in reducing output skew. The lecture concludes by addressing the remaining challenges in achieving optimal join computation over distributed data.
This video is available exclusively on Mediaspace for a restricted audience. Please log in to MediaSpace to access it if you have the necessary permissions.
Watch on Mediaspace