Lecture

Fault Tolerance and Recovery: Data Safety in Distributed Computing

Related lectures (23)

Explores the Spark ecosystem, Resilient Distributed Datasets, and the storage layer abstraction in Spark.

Covers data science tools, Hadoop, Spark, data lake ecosystems, CAP theorem, batch vs. stream processing, HDFS, Hive, Parquet, ORC, and MapReduce architecture.

Hadoop: Execution Models

Explores Hadoop's execution models, fault tolerance, data locality, and scheduling, highlighting the limitations of MapReduce and alternative distributed processing frameworks.

Big Data Challenges: Distributed Computing with Spark

Explores big data challenges, distributed computing with Spark, RDDs, hardware requirements, MapReduce, transformations, and Spark DataFrames.

Big Data Ecosystems: Technologies and Challenges

Covers the fundamentals of big data ecosystems, focusing on technologies, challenges, and practical exercises with Hadoop's HDFS.

Big Data: Best Practices and Guidelines

Covers best practices and guidelines for big data, including data lakes, typical architecture, challenges, and technologies used to address them.

Scheduling Decisions: Data Locality and Multitenancy

Explores data locality in scheduling decisions for multi-tenant platforms and discusses Hadoop's architecture, execution engine optimizations, and fault tolerance strategies.

Scaling up: Spark and Big Data

Explores the challenges of big data processing and introduces Spark as a solution.

Data Wrangling with Hive: Managing Big Data Efficiently

Covers data wrangling techniques using Apache Hive for efficient big data management.

Integrating Scalable Data Storage and Map Reduce Processing with Hadoop

Covers the integration of scalable data storage and map reduce processing using Hadoop, including HDFS, Hive, Parquet, ORC, Spark, and HBase.

Big Data Best Practices and Guidelines

Covers best practices and guidelines for big data, including data lakes, architecture, challenges, and technologies like Hadoop and Hive.

General-Purpose Distributed Execution System

Explores the design of a general-purpose distributed execution system, covering challenges, specialized frameworks, decentralized control logic, and high-performance shuffle.

Introduction to Spark Runtime Architecture

Covers the Spark runtime architecture, including RDDs, transformations, actions, and caching for performance optimization.

Data Wrangling with Hadoop

Covers data wrangling techniques using Hadoop, focusing on row versus column-oriented databases, popular storage formats, and HBase-Hive integration.

Big Data Challenges: Scaling to Massive Data

Explores challenges of handling massive data in the era of big data, discussing solutions like MapReduce and Spark.

Hadoop Ecosystem: Architectural Choices & MapReduce Programming

Explores the Hadoop ecosystem's architecture and MapReduce programming model, emphasizing strengths and limitations.

Execution Models for Distributed Computing - 2nd generation

Explores the 2nd generation of execution models for distributed computing, focusing on Spark and Resilient Distributed Datasets (RDDs).

Introduction to Spark runtime architecture

Introduces Apache Spark, covering its key features, history, RDDs, architecture, and distributed computing framework.

Data formats and data wrangling with Hadoop

Explores Apache Hive for data warehousing, data formats, and partitioning, with practical exercises in querying and connecting to Hive.

Data Wrangling with Hadoop: Storage Formats and Hive

Explores data wrangling with Hadoop, emphasizing storage formats and Hive for big data processing.