High-Value Token-Blocking: Efficient Blocking Method for Record Linkage

Kevin O'Hare, Anna Jurek-Loughrey, Cassio Pires, Cassio de Campos

Research output: Contribution to journalArticlepeer-review

145 Downloads (Pure)


Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.
Original languageEnglish
Article number24
Number of pages17
JournalACM Transactions on Knowledge Discovery from Data
Issue number2
Publication statusPublished - 01 Jul 2021


Dive into the research topics of 'High-Value Token-Blocking: Efficient Blocking Method for Record Linkage'. Together they form a unique fingerprint.

Cite this