An unsupervised blocking technique for more efficient record linkage

Research output: Contribution to journalArticle

Abstract

Record linkage, referred to also as entity resolution, is the process of identifying pairs of records representing the same real-world entity (for example, a person)within a dataset or across multiple datasets. This allows for the integration of multi-source data which allows for better knowledge discovery. In order to reduce the number of record comparisons, record linkage frameworks initially perform a process commonly referred to as blocking, which involves separating records into blocks using a partition (or blocking) scheme. This restricts comparisons among records that belong to the same block during the linkage process. Existing blocking techniques often require some form of manual fine-tuning of parameter values for optimal performance. Optimal parameter values may be selected manually by a domain expert, or automatically learned using labelled data. However, in many real world situations no such labelled dataset may be available. In this paper we propose a novel unsupervised blocking technique for structured datasets that does not require labelled data or manual fine-tuning of parameters. Experimental evaluations, across a large number of datasets, demonstrate that this novel approach often achieves superior levels of proficiency to both supervised and unsupervised baseline techniques, often in less time.
Original languageEnglish
Pages (from-to)181-195
Number of pages15
JournalData & Knowledge Engineering
Volume122
DOIs
Publication statusPublished - 08 Jul 2019

Fingerprint

Tuning
Data mining

Cite this

@article{a04e23a9eded4674af0bcbc16733b205,
title = "An unsupervised blocking technique for more efficient record linkage",
abstract = "Record linkage, referred to also as entity resolution, is the process of identifying pairs of records representing the same real-world entity (for example, a person)within a dataset or across multiple datasets. This allows for the integration of multi-source data which allows for better knowledge discovery. In order to reduce the number of record comparisons, record linkage frameworks initially perform a process commonly referred to as blocking, which involves separating records into blocks using a partition (or blocking) scheme. This restricts comparisons among records that belong to the same block during the linkage process. Existing blocking techniques often require some form of manual fine-tuning of parameter values for optimal performance. Optimal parameter values may be selected manually by a domain expert, or automatically learned using labelled data. However, in many real world situations no such labelled dataset may be available. In this paper we propose a novel unsupervised blocking technique for structured datasets that does not require labelled data or manual fine-tuning of parameters. Experimental evaluations, across a large number of datasets, demonstrate that this novel approach often achieves superior levels of proficiency to both supervised and unsupervised baseline techniques, often in less time.",
author = "Kevin O'Hare and Anna Jurek-Loughrey and {de Campos}, Cassio",
year = "2019",
month = "7",
day = "8",
doi = "10.1016/j.datak.2019.06.005",
language = "English",
volume = "122",
pages = "181--195",
journal = "Data & Knowledge Engineering",
publisher = "Elsevier",

}

An unsupervised blocking technique for more efficient record linkage. / O'Hare, Kevin; Jurek-Loughrey, Anna; de Campos, Cassio.

In: Data & Knowledge Engineering , Vol. 122, 08.07.2019, p. 181-195.

Research output: Contribution to journalArticle

TY - JOUR

T1 - An unsupervised blocking technique for more efficient record linkage

AU - O'Hare, Kevin

AU - Jurek-Loughrey, Anna

AU - de Campos, Cassio

PY - 2019/7/8

Y1 - 2019/7/8

N2 - Record linkage, referred to also as entity resolution, is the process of identifying pairs of records representing the same real-world entity (for example, a person)within a dataset or across multiple datasets. This allows for the integration of multi-source data which allows for better knowledge discovery. In order to reduce the number of record comparisons, record linkage frameworks initially perform a process commonly referred to as blocking, which involves separating records into blocks using a partition (or blocking) scheme. This restricts comparisons among records that belong to the same block during the linkage process. Existing blocking techniques often require some form of manual fine-tuning of parameter values for optimal performance. Optimal parameter values may be selected manually by a domain expert, or automatically learned using labelled data. However, in many real world situations no such labelled dataset may be available. In this paper we propose a novel unsupervised blocking technique for structured datasets that does not require labelled data or manual fine-tuning of parameters. Experimental evaluations, across a large number of datasets, demonstrate that this novel approach often achieves superior levels of proficiency to both supervised and unsupervised baseline techniques, often in less time.

AB - Record linkage, referred to also as entity resolution, is the process of identifying pairs of records representing the same real-world entity (for example, a person)within a dataset or across multiple datasets. This allows for the integration of multi-source data which allows for better knowledge discovery. In order to reduce the number of record comparisons, record linkage frameworks initially perform a process commonly referred to as blocking, which involves separating records into blocks using a partition (or blocking) scheme. This restricts comparisons among records that belong to the same block during the linkage process. Existing blocking techniques often require some form of manual fine-tuning of parameter values for optimal performance. Optimal parameter values may be selected manually by a domain expert, or automatically learned using labelled data. However, in many real world situations no such labelled dataset may be available. In this paper we propose a novel unsupervised blocking technique for structured datasets that does not require labelled data or manual fine-tuning of parameters. Experimental evaluations, across a large number of datasets, demonstrate that this novel approach often achieves superior levels of proficiency to both supervised and unsupervised baseline techniques, often in less time.

U2 - 10.1016/j.datak.2019.06.005

DO - 10.1016/j.datak.2019.06.005

M3 - Article

VL - 122

SP - 181

EP - 195

JO - Data & Knowledge Engineering

JF - Data & Knowledge Engineering

ER -