Lecture

Data Wrangling with Hadoop: Advanced Techniques

Description

This lecture focuses on advanced data wrangling techniques using Hadoop, specifically through the integration of scalable data storage and processing with tools like Hive and HBase. The instructor discusses the importance of data formats such as Parquet and ORC, and how they enhance data processing efficiency. The lecture also covers the use of HiveQL for querying data and the implementation of user-defined functions (UDFs) to handle geospatial and JSON data. Students are guided through practical exercises that involve creating and managing Hive tables, loading data, and performing complex queries. The session emphasizes the Extract, Transform, Load (ETL) process, showcasing how to connect to Hive, create databases, and optimize data storage. Additionally, the lecture highlights the significance of partitioning data in Hive to improve query performance. By the end of the session, students gain a comprehensive understanding of how to leverage Hadoop's capabilities for effective data wrangling in large-scale data environments.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.