Covers data science tools, Hadoop, Spark, data lake ecosystems, CAP theorem, batch vs. stream processing, HDFS, Hive, Parquet, ORC, and MapReduce architecture.
Explores the design of a general-purpose distributed execution system, covering challenges, specialized frameworks, decentralized control logic, and high-performance shuffle.