Lecture

Data formats and data wrangling with Hadoop

In course

COM-490: Large-scale data science for real-world data

This hands-on course teaches the tools & methods used by data scientists, from researching solutions to scaling up prototypes to Spark clusters. It exposes the students to the entire data science pipe

Description

This lecture covers the use of Apache Hive as a data warehouse for managing large datasets, including data formats, data wrangling, and partitioning. Topics include HiveQL, schema on read, data storage, partitioning, and Hive under the hood. The lecture also explores popular HDFS storage formats, Hive equivalents in cloud solutions, and the differences between HDFS and Hive. Practical exercises involve creating external tables, querying data, and connecting to Hive using PyHive.

Instructors (3)

Olivier Verscheure

Sofiane Sarni

Pamela Isabel Delgado Borda

I am a PhD student in the School of Computer and Communication Sciences at EPFL. I am part of the Operating Systems Laboratory and my advisor is Prof. Willy Zwaenepoel. I received my Bachelor's degree in Systems Engineering from Universidad Catolica Boliviana, Bolivia in 2008 and Master's degree in Computer Science, specialization Foundations of Software from EPFL in 2012.

Official source

Ontological neighbourhood

Computer engineering

Databases: Relational databases

Related lectures (32)

Data Wrangling with Hive: Managing Big Data Efficiently

Covers data wrangling techniques using Apache Hive for efficient big data management.

Data Wrangling with Hadoop

Covers data wrangling techniques using Hadoop, focusing on row versus column-oriented databases, popular storage formats, and HBase-Hive integration.

Data Wrangling with Hadoop: Storage Formats and Hive

Explores data wrangling with Hadoop, emphasizing storage formats and Hive for big data processing.

General Introduction to Big Data

Covers data science tools, Hadoop, Spark, data lake ecosystems, CAP theorem, batch vs. stream processing, HDFS, Hive, Parquet, ORC, and MapReduce architecture.

Big Data Best Practices and Guidelines

Covers best practices and guidelines for big data, including data lakes, architecture, challenges, and technologies like Hadoop and Hive.