Lectures related to Data Wrangling with Hadoop: Advanced Techniques

Data Wrangling with Hive: Managing Big Data Efficiently

Covers data wrangling techniques using Apache Hive for efficient big data management.

Covers best practices and guidelines for big data, including data lakes, architecture, challenges, and technologies like Hadoop and Hive.

Data Wrangling with Hadoop

Covers data wrangling techniques using Hadoop, focusing on row versus column-oriented databases, popular storage formats, and HBase-Hive integration.

Data Wrangling Techniques: HBase and Hive Integration

Covers data wrangling techniques using HBase and Hive, focusing on integration and practical applications.

Data Wrangling with Hadoop: Storage Formats and Hive

Explores data wrangling with Hadoop, emphasizing storage formats and Hive for big data processing.

General Introduction to Big Data

Covers data science tools, Hadoop, Spark, data lake ecosystems, CAP theorem, batch vs. stream processing, HDFS, Hive, Parquet, ORC, and MapReduce architecture.

Big Data Ecosystems: Technologies and Challenges

Covers the fundamentals of big data ecosystems, focusing on technologies, challenges, and practical exercises with Hadoop's HDFS.

Advanced Spark Optimization Techniques: Managing Big Data

Discusses advanced Spark optimization techniques for managing big data efficiently, focusing on parallelization, shuffle operations, and memory management.

Introduction to Spark Runtime Architecture

Covers the Spark runtime architecture, including RDDs, transformations, actions, and caching for performance optimization.

General Introduction to Data Science

Offers a comprehensive introduction to Data Science, covering Python, Numpy, Pandas, Matplotlib, and Scikit-learn, with a focus on practical exercises and collaborative work.

Handling Data: Data Models and Wrangling

Explores data handling fundamentals, including models, sources, and wrangling, emphasizing the importance of understanding and addressing data problems.

Data Cleaning Challenges: Optimizing Error Detection

Addresses challenges in data cleaning for analysis, proposing optimizations to reduce processing time.

Integrating Scalable Data Storage and Map Reduce Processing with Hadoop

Covers the integration of scalable data storage and map reduce processing using Hadoop, including HDFS, Hive, Parquet, ORC, Spark, and HBase.

Big Data: Best Practices and Guidelines

Covers best practices and guidelines for big data, including data lakes, typical architecture, challenges, and technologies used to address them.

Data Analysis to AI and ML, Social Media

Explores the evolution from data analysis to AI and ML, emphasizing big data, machine learning, and social media interaction.

Introduction to Spark Runtime Architecture

Introduces Apache Spark, covering its architecture, RDDs, transformations, actions, fault tolerance, deployment options, and practical exercises in Jupyter notebooks.

Introduction to Data Stream Processing: Concepts and Applications

Covers the principles of data stream processing and its applications in real-time data analysis.

Big Data Challenges: Scaling to Massive Data

Explores challenges of handling massive data in the era of big data, discussing solutions like MapReduce and Spark.

Data formats and data wrangling with Hadoop

Explores Apache Hive for data warehousing, data formats, and partitioning, with practical exercises in querying and connecting to Hive.

Introduction to Applied Data Analysis

Introduces the Applied Data Analysis course at EPFL, covering a broad range of data analysis topics and emphasizing continuous learning in data science.