A Review of Unsupervised and Semi-supervised Blocking Methods for Record Linkage

Kevin O'Hare, Anna Jurek-Loughrey, Cassio de Campos

Research output: Chapter in Book/Report/Conference proceedingChapter (peer-reviewed)peer-review

Abstract

Record linkage, referred to also as entity resolution, is a process of identifying records representing the same real-world entity (e.g. a person) across varied data sources. To reduce the computational complexity associated with record comparisons, a task referred to as blocking is commonly performed prior to the linkage process. The blocking task involves partitioning records into blocks of records and treating records from different blocks as not related to the same entity. Following this, record linkage methods are applied within each block significantly reducing the number of record comparisons. Most of the existing blocking techniques require some degree of parameter selection in order to optimise the performance for a particular dataset (e.g. attributes and blocking functions used for splitting records into blocks). Optimal parameters can be selected manually but this is expensive in terms of time and cost and assumes a domain expert to be available. Automatic supervised blocking techniques have been proposed; however, they require a set of labelled data in which the matching status of each record is known. In the majority of real-world scenarios, we do not have any information regarding the matching status of records obtained from multiple sources. Therefore, there is a demand for blocking techniques that sufficiently reduce the number of record comparisons with little to no human input or labelled data required. Given the importance of the problem, recent research efforts have seen the development of novel unsupervised and semi-supervised blocking techniques. In this chapter, we review existing blocking techniques and discuss their advantages and disadvantages. We detail other research areas that have recently arose and discuss other unresolved issues that are still to be addressed.
Original languageEnglish
Title of host publicationLinking and Mining Heterogeneous and Multi-view Data
EditorsDeepak P, Anna Jurek-Loughrey
PublisherSpringer
Pages79-105
ISBN (Electronic)978-3-030-01871-9
DOIs
Publication statusPublished - 27 Nov 2018

Publication series

NameUnsupervised and Semi-Supervised Learning
PublisherSpringer

Fingerprint

Dive into the research topics of 'A Review of Unsupervised and Semi-supervised Blocking Methods for Record Linkage'. Together they form a unique fingerprint.

Cite this