Distributed Computing Execution Models

In course

This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students

Description

This lecture discusses the challenges of minimizing job completion time in distributed computing, focusing on data skew issues and their impact on performance. It explores the implications of skewed data distribution on reducers, the limitations of standard approaches, and the optimization goals to enhance efficiency. The presentation covers execution models like MapReduce and Spark, emphasizing the importance of parallelism and efficient processing. Various algorithms for theta-joins are examined, including the 1-Bucket-Theta algorithm, highlighting the benefits of randomization in reducing output skew. The lecture concludes by addressing the remaining challenges in achieving optimal join computation over distributed data.

This video is available exclusively on Mediaspace for a restricted audience. Please log in to MediaSpace to access it if you have the necessary permissions.

Watch on Mediaspace

Instructor

Anastasia Ailamaki

Official source