Semi-supervised and Unsupervised Approaches to Record Pairs Classification in Multi-Source Data Linkage

Research output: Chapter in Book/Report/Conference proceedingChapter (peer-reviewed)

Abstract

Data integration has become one of the main challenges in the era of Big Data analytics. Often to enable decision-making, data from different sources have to be integrated and linked together. For example, multi-source data integration is vital to police, counter terrorism and national security to allow efficient and accurate verification of people. One of the key challenges in the data integration process is matching records that represent the same real-world entity (e.g. person). This process is referred to as record linkage. In many cases, data sets do not share a unique identifier (e.g. National Insurance Number), hence records need to be matched by comparing their corresponding attributes. Most of the existing record linkage methods require assistance from a domain expert for handcrafting domain-specific linking rules. More automatic approaches, based on using machine learning, were also proposed. However, those approaches relay on having a substantial set of manually labelled records, which makes them inapplicable in real-world scenarios. Given the importance of the problem, record linkage has witnessed a strong interest in the past decade. As a result, significant progress has been made in this area. In particular, the problem of reducing the manual effort and the amount of labelled data required for constructing record linkage models has been addressed in many studies. In this chapter, we review the most recently proposed approaches to semi-supervised and unsupervised record linkage.
LanguageEnglish
Title of host publicationLinking and Mining Heterogeneous and Multi-view Data
PublisherSpringer
Pages55-78
DOIs
Publication statusPublished - Nov 2018

Fingerprint

Data integration
Terrorism
National security
Insurance
Law enforcement
Learning systems
Decision making

Cite this

@inbook{8810981aed95423f996a5487dc4f0d5d,
title = "Semi-supervised and Unsupervised Approaches to Record Pairs Classification in Multi-Source Data Linkage",
abstract = "Data integration has become one of the main challenges in the era of Big Data analytics. Often to enable decision-making, data from different sources have to be integrated and linked together. For example, multi-source data integration is vital to police, counter terrorism and national security to allow efficient and accurate verification of people. One of the key challenges in the data integration process is matching records that represent the same real-world entity (e.g. person). This process is referred to as record linkage. In many cases, data sets do not share a unique identifier (e.g. National Insurance Number), hence records need to be matched by comparing their corresponding attributes. Most of the existing record linkage methods require assistance from a domain expert for handcrafting domain-specific linking rules. More automatic approaches, based on using machine learning, were also proposed. However, those approaches relay on having a substantial set of manually labelled records, which makes them inapplicable in real-world scenarios. Given the importance of the problem, record linkage has witnessed a strong interest in the past decade. As a result, significant progress has been made in this area. In particular, the problem of reducing the manual effort and the amount of labelled data required for constructing record linkage models has been addressed in many studies. In this chapter, we review the most recently proposed approaches to semi-supervised and unsupervised record linkage.",
author = "Anna Jurek-Loughrey and Deepak Padmanabhan",
year = "2018",
month = "11",
doi = "https://doi.org/10.1007/978-3-030-01872-6_3",
language = "English",
pages = "55--78",
booktitle = "Linking and Mining Heterogeneous and Multi-view Data",
publisher = "Springer",

}

Semi-supervised and Unsupervised Approaches to Record Pairs Classification in Multi-Source Data Linkage. / Jurek-Loughrey, Anna; Padmanabhan, Deepak.

Linking and Mining Heterogeneous and Multi-view Data. Springer, 2018. p. 55-78.

Research output: Chapter in Book/Report/Conference proceedingChapter (peer-reviewed)

TY - CHAP

T1 - Semi-supervised and Unsupervised Approaches to Record Pairs Classification in Multi-Source Data Linkage

AU - Jurek-Loughrey, Anna

AU - Padmanabhan, Deepak

PY - 2018/11

Y1 - 2018/11

N2 - Data integration has become one of the main challenges in the era of Big Data analytics. Often to enable decision-making, data from different sources have to be integrated and linked together. For example, multi-source data integration is vital to police, counter terrorism and national security to allow efficient and accurate verification of people. One of the key challenges in the data integration process is matching records that represent the same real-world entity (e.g. person). This process is referred to as record linkage. In many cases, data sets do not share a unique identifier (e.g. National Insurance Number), hence records need to be matched by comparing their corresponding attributes. Most of the existing record linkage methods require assistance from a domain expert for handcrafting domain-specific linking rules. More automatic approaches, based on using machine learning, were also proposed. However, those approaches relay on having a substantial set of manually labelled records, which makes them inapplicable in real-world scenarios. Given the importance of the problem, record linkage has witnessed a strong interest in the past decade. As a result, significant progress has been made in this area. In particular, the problem of reducing the manual effort and the amount of labelled data required for constructing record linkage models has been addressed in many studies. In this chapter, we review the most recently proposed approaches to semi-supervised and unsupervised record linkage.

AB - Data integration has become one of the main challenges in the era of Big Data analytics. Often to enable decision-making, data from different sources have to be integrated and linked together. For example, multi-source data integration is vital to police, counter terrorism and national security to allow efficient and accurate verification of people. One of the key challenges in the data integration process is matching records that represent the same real-world entity (e.g. person). This process is referred to as record linkage. In many cases, data sets do not share a unique identifier (e.g. National Insurance Number), hence records need to be matched by comparing their corresponding attributes. Most of the existing record linkage methods require assistance from a domain expert for handcrafting domain-specific linking rules. More automatic approaches, based on using machine learning, were also proposed. However, those approaches relay on having a substantial set of manually labelled records, which makes them inapplicable in real-world scenarios. Given the importance of the problem, record linkage has witnessed a strong interest in the past decade. As a result, significant progress has been made in this area. In particular, the problem of reducing the manual effort and the amount of labelled data required for constructing record linkage models has been addressed in many studies. In this chapter, we review the most recently proposed approaches to semi-supervised and unsupervised record linkage.

U2 - https://doi.org/10.1007/978-3-030-01872-6_3

DO - https://doi.org/10.1007/978-3-030-01872-6_3

M3 - Chapter (peer-reviewed)

SP - 55

EP - 78

BT - Linking and Mining Heterogeneous and Multi-view Data

PB - Springer

ER -