Lecture

Advanced Spark Optimizations and Partitioning

In course

COM-490: Large-scale data science for real-world data

This hands-on course teaches the tools & methods used by data scientists, from researching solutions to scaling up prototypes to Spark clusters. It exposes the students to the entire data science pipe

Description

This lecture covers advanced Spark optimizations and partitioning techniques, including dealing with data skew, imbalance, and using persistency. It also discusses an optimization checklist, best practices, and the use of persistence levels. Additionally, it explores Spark MLlib for machine learning tasks, such as classification, logistic regression, clustering, and provides useful references for further learning.

Instructors (3)

Olivier Verscheure

Sofiane Sarni

Pamela Isabel Delgado Borda

I am a PhD student in the School of Computer and Communication Sciences at EPFL. I am part of the Operating Systems Laboratory and my advisor is Prof. Willy Zwaenepoel. I received my Bachelor's degree in Systems Engineering from Universidad Catolica Boliviana, Bolivia in 2008 and Master's degree in Computer Science, specialization Foundations of Software from EPFL in 2012.

Official source

Related lectures (31)

Efficient Machine Learning via Data Summarization

Explores efficient machine learning through data summarization, covering challenges, methods, and impactful applications in various domains.

Logistic Regression: Fundamentals and Applications

Explores logistic regression fundamentals, including cost functions, regularization, and classification boundaries, with practical examples using scikit-learn.

Statistical Analysis of Networks: Link Prediction and Biclustering

Explores link prediction, logistic regression, causal inference, and biclustering in statistical network analysis.

Supervised Learning Overview

Covers CNNs, RNNs, SVMs, and supervised learning methods, emphasizing the importance of tuning regularization and making informed decisions in machine learning.

Introduction to Machine Learning

Covers the basics of machine learning for physicists and chemists, focusing on image classification and dataset labeling.