Lecture

Temporality and Entity Resolution

In course

This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students

Description

This lecture covers the challenges of dealing with temporality in data, including the time of data entry and the time when a recorded phenomenon is considered true. It also delves into entity resolution, which involves identifying and merging duplicate entity profiles across datasets. Various techniques such as fuzzy matching, deduplication, and similarity metrics like Jaccard similarity are discussed. The lecture further explores the complexities of duplicate entity detection, data deduplication, and the computational costs involved. Strategies for reducing the computational cost of duplicate detection, such as blocking for candidate selection and q-gram set join, are explained. The session concludes with a summary of entity resolution and data wrangling tasks, emphasizing the importance of optimizations to make clustering more efficient.

Instructor

Anastasia Ailamaki

Official source

Related lectures (32)

Advanced Structure Discovery: Distance Metrics and Time Series Data

Explores clustering algorithms, distance metrics, and time series data analysis techniques.

Data Modeling: Concepts and Applications

Explores data modeling concepts, SQL implementations, and practical applications in handling missing data.

Supervised Learning: k-NN and Decision Trees

Introduces supervised learning with k-NN and decision trees, covering techniques, examples, and ensemble methods.

General Introduction to Data Science

Offers a comprehensive introduction to Data Science, covering Python, Numpy, Pandas, Matplotlib, and Scikit-learn, with a focus on practical exercises and collaborative work.

Supervised Learning Overview

Covers CNNs, RNNs, SVMs, and supervised learning methods, emphasizing the importance of tuning regularization and making informed decisions in machine learning.