Lecture

Stream Processing and Fault Tolerance

In course

This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students

Description

This lecture covers the concepts of stream processing and fault tolerance in big data analytics. It discusses the measurement of time in data streams, efficient stream management techniques, scaling-out platforms like Spark Streaming and Apache Flink, fault tolerance strategies such as replication and upstream backup, and the use of DStreams for discretized stream processing. The instructor explains fault tolerance techniques for stream processing systems, including state partitioning and immutable tasks. Examples of streaming word count and sliding window operations are provided, showcasing the combination of batch and streaming computations. The lecture concludes with a vision of unifying batch and stream processing models in a single stack.

This video is available exclusively on Mediaspace for a restricted audience. Please log in to MediaSpace to access it if you have the necessary permissions.

Watch on Mediaspace

Instructor

Anastasia Ailamaki

Official source

Related lectures (15)

Introduction to Data Stream Processing

Covers the fundamentals of data stream processing, including tools like Apache Storm and Kafka, key concepts like event time and window operations, and the challenges of stream processing.

Advanced Data Stream Processing Concepts

Explores event time vs. processing time, stream processing operations, stream-stream joins, and handling late/out-of-order data in data stream processing.

Data Stream Processing: Apache Kafka and Spark

Covers data stream processing with Apache Kafka and Spark, including event time vs processing time, stream processing operations, and stream-stream joins.

Introduction to Data Stream Processing: Concepts and Applications

Covers data stream processing concepts, focusing on Apache Kafka and Spark Streaming integration, event time management, and project implementation guidelines.

General Introduction to Big Data

Covers data science tools, Hadoop, Spark, data lake ecosystems, CAP theorem, batch vs. stream processing, HDFS, Hive, Parquet, ORC, and MapReduce architecture.