Deep Learning Based Approach to Unstructured Record Linkage

Research output: Contribution to journalArticlepeer-review

4 Citations (Scopus)
507 Downloads (Pure)

Abstract

Purpose
In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision making. Integrating data from multiple sources drastically expands the power of information and allows us to address questions that are impossible to answer using a single data source. Record Link- age (RL) is a task of identifying and linking records from multiple sources that describe the same real world object (e.g. person), and it plays a crucial role in the data integration process. RL is challenging as it is uncommon for different data sources to share a unique identifier. Hence the records must be matched based on the comparison of their corresponding values. Most of the existing RL techniques assume that records across different data sources are structured and represented by the same scheme (i.e. set of attributes). Given the increasing amount of heterogeneous data sources, those assumptions are rather unrealistic. The purpose of this paper is to propose a novel RL model for unstructured data. Methodology
In our previous work [16] we proposed a novel approach to linking unstructured data based on the application of the Siamese Multilayer Perceptron model. It was demonstrated that our method performed on par with other approaches that make constraining assumptions regarding the data. This paper expands our previous work originally presented at iiWAS2020 [16] by exploring new architec- tures of the Siamese Neural Network, which improves the generalisation of the RL model and makes it less sensitive to parameter selection.
Findings
The experimental results confirm that the new Autoencoder based architec- ture of the Siamese Neural Network obtains better results in comparison to the Siamese Multilayer Perceptron model proposed in [16]. Better results have been achieved in three out of four datasets. Furthermore, it has been demonstrated that the second proposed (hybrid) architecture based on integrating the Siamese Autoencoder with a Multilayer Perceptron model, makes the model more stable in terms of the parameter selection.
Originality
To address the problem of unstructured RL, this paper presents a new deep learning based approach to improve the generalisation of the Siamese Multilayer Preceptron model and makes it less sensitive to parameter selection.
Original languageEnglish
JournalInternational Journal of Web Information Systems
Early online date18 Oct 2021
DOIs
Publication statusPublished - 01 Dec 2021

Fingerprint

Dive into the research topics of 'Deep Learning Based Approach to Unstructured Record Linkage'. Together they form a unique fingerprint.

Cite this