Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
This lecture covers the concept of entity resolution (ER), which involves identifying and aggregating different entity profiles that refer to the same real-world entity across datasets. Topics include duplicate elimination, record linkage, similarity metrics, data deduplication, and possible repairs. The instructor also discusses the challenges of dealing with duplicate entities, such as name/attribute ambiguity and errors due to data entry. Various techniques like clustering, blocking, q-gram set join, and ClusterJoin algorithm are explained in detail to handle duplicate detection and entity clustering efficiently.