Semi-supervised and Unsupervised Approaches to Record Pairs Classification in Multi-Source Data Linkage

Research output: Chapter in Book/Report/Conference proceedingChapter (peer-reviewed)peer-review


Data integration has become one of the main challenges in the era of Big Data analytics. Often to enable decision-making, data from different sources have to be integrated and linked together. For example, multi-source data integration is vital to police, counter terrorism and national security to allow efficient and accurate verification of people. One of the key challenges in the data integration process is matching records that represent the same real-world entity (e.g. person). This process is referred to as record linkage. In many cases, data sets do not share a unique identifier (e.g. National Insurance Number), hence records need to be matched by comparing their corresponding attributes. Most of the existing record linkage methods require assistance from a domain expert for handcrafting domain-specific linking rules. More automatic approaches, based on using machine learning, were also proposed. However, those approaches relay on having a substantial set of manually labelled records, which makes them inapplicable in real-world scenarios. Given the importance of the problem, record linkage has witnessed a strong interest in the past decade. As a result, significant progress has been made in this area. In particular, the problem of reducing the manual effort and the amount of labelled data required for constructing record linkage models has been addressed in many studies. In this chapter, we review the most recently proposed approaches to semi-supervised and unsupervised record linkage.
Original languageEnglish
Title of host publicationLinking and Mining Heterogeneous and Multi-view Data
Publication statusPublished - Nov 2018


Dive into the research topics of 'Semi-supervised and Unsupervised Approaches to Record Pairs Classification in Multi-Source Data Linkage'. Together they form a unique fingerprint.

Cite this