Lecture

Introduction to Spark Runtime Architecture

In course

COM-490: Large-scale data science for real-world data

This hands-on course teaches the tools & methods used by data scientists, from researching solutions to scaling up prototypes to Spark clusters. It exposes the students to the entire data science pipe

Description

This lecture provides an overview of Apache Spark, a unified analytics engine for large-scale data processing, covering its architecture, history, key features, and flexibility. It explains the Spark runtime components, such as RDDs, transformations, actions, and lineage. The lecture also delves into Spark's distributed computing framework, basic data abstraction with RDDs, and the importance of fault tolerance. Additionally, it explores Spark's deployment options, supported languages, data storage, and specialized libraries. Practical exercises using Sparkmagic in Jupyter notebooks are highlighted, along with references for further exploration.

Instructors (3)

Sofiane Sarni

Pamela Isabel Delgado Borda

I am a PhD student in the School of Computer and Communication Sciences at EPFL. I am part of the Operating Systems Laboratory and my advisor is Prof. Willy Zwaenepoel. I received my Bachelor's degree in Systems Engineering from Universidad Catolica Boliviana, Bolivia in 2008 and Master's degree in Computer Science, specialization Foundations of Software from EPFL in 2012.

Olivier Verscheure

Official source

Related lectures (32)

Big Data Best Practices and Guidelines

Covers best practices and guidelines for big data, including data lakes, architecture, challenges, and technologies like Hadoop and Hive.

Data Wrangling with Hive: Managing Big Data Efficiently

Covers data wrangling techniques using Apache Hive for efficient big data management.

Data Wrangling Techniques: HBase and Hive Integration

Covers data wrangling techniques using HBase and Hive, focusing on integration and practical applications.

Introduction to Spark Runtime Architecture

Covers the Spark runtime architecture, including RDDs, transformations, actions, and caching for performance optimization.

Digital Transformation: Solutions and Data

Explores digital transformation opportunities, big data, analytics, and technology innovations in business and research.