Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
This lecture covers the challenges of dealing with temporality in data, including the time of data entry and the time when a recorded phenomenon is considered true. It also delves into entity resolution, which involves identifying and merging duplicate entity profiles across datasets. Various techniques such as fuzzy matching, deduplication, and similarity metrics like Jaccard similarity are discussed. The lecture further explores the complexities of duplicate entity detection, data deduplication, and the computational costs involved. Strategies for reducing the computational cost of duplicate detection, such as blocking for candidate selection and q-gram set join, are explained. The session concludes with a summary of entity resolution and data wrangling tasks, emphasizing the importance of optimizations to make clustering more efficient.