Record linkage, referred to also as entity resolution, is the process of identifying pairs of records representing the same real-world entity (for example, a person)within a dataset or across multiple datasets. This allows for the integration of multi-source data which allows for better knowledge discovery. In order to reduce the number of record comparisons, record linkage frameworks initially perform a process commonly referred to as blocking, which involves separating records into blocks using a partition (or blocking) scheme. This restricts comparisons among records that belong to the same block during the linkage process. Existing blocking techniques often require some form of manual fine-tuning of parameter values for optimal performance. Optimal parameter values may be selected manually by a domain expert, or automatically learned using labelled data. However, in many real world situations no such labelled dataset may be available. In this paper we propose a novel unsupervised blocking technique for structured datasets that does not require labelled data or manual fine-tuning of parameters. Experimental evaluations, across a large number of datasets, demonstrate that this novel approach often achieves superior levels of proficiency to both supervised and unsupervised baseline techniques, often in less time.
FingerprintDive into the research topics of 'An unsupervised blocking technique for more efficient record linkage'. Together they form a unique fingerprint.
Supervisor: Jurek-Loughrey, A. (Supervisor) & de Campos, C. (Supervisor)
Student thesis: Doctoral Thesis › Doctor of PhilosophyFile