Record linkage, referred to also as entity resolution, is a process of identifying records representing the same real-world entity (e.g. a person) across varied data sources. To reduce the computational complexity associated with record comparisons, a task referred to as blocking is commonly performed prior to the linkage process. The blocking task involves partitioning records into blocks of records and treating records from different blocks as not related to the same entity. Following this, record linkage methods are applied within each block significantly reducing the number of record comparisons. Most of the existing blocking techniques require some degree of parameter selection in order to optimise the performance for a particular dataset (e.g. attributes and blocking functions used for splitting records into blocks). Optimal parameters can be selected manually but this is expensive in terms of time and cost and assumes a domain expert to be available. Automatic supervised blocking techniques have been proposed; however, they require a set of labelled data in which the matching status of each record is known. In the majority of real-world scenarios, we do not have any information regarding the matching status of records obtained from multiple sources. Therefore, there is a demand for blocking techniques that sufficiently reduce the number of record comparisons with little to no human input or labelled data required. Given the importance of the problem, recent research efforts have seen the development of novel unsupervised and semi-supervised blocking techniques. In this chapter, we review existing blocking techniques and discuss their advantages and disadvantages. We detail other research areas that have recently arose and discuss other unresolved issues that are still to be addressed.
|Title of host publication
|Linking and Mining Heterogeneous and Multi-view Data
|Deepak P, Anna Jurek-Loughrey
|Published - 27 Nov 2018
|Unsupervised and Semi-Supervised Learning