Making it tractable to detect and correct errors in graphs

Wenfei Fan, Kehan Pang, Ping Lu, Chao Tian

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

This article develops Hercules, a system for entity resolution (ER), conflict resolution (CR), timeliness deduction (TD), and missing value/link imputation (MI) in graphs. It proposes GCR+s, a class of graph cleaning rules (GCR) that support not only predicates for ER and CR but also temporal orders to deduce timeliness and data extraction to impute missing data. As opposed to previous graph rules, GCR+s are defined with a dual graph pattern to accommodate irregular structures of schemaless graphs and adopt patterns of a star form to reduce the complexity. We show that while the implication and satisfiability problems are intractable for GCR+s, it is in polynomial time to detect and correct errors with GCR+s. Underlying Hercules, we train a ranking model to predict the temporal orders on attributes and embed it as a predicate of GCR+s. We provide an algorithm for discovering GCR+s by combining the generations of patterns and predicates. We also develop a method for conducting ER, CR, TD, and MI in the same process to improve the overall quality of graphs by leveraging their interactions and chasing with GCR+s; we show that the method has the Church–Rosser property under certain conditions. Using real-life and synthetic graphs, we empirically verify that Hercules is 53% more accurate than the state-of-the-art graph cleaning systems and performs comparably in efficiency and scalability.

Original languageEnglish
Article number16
Pages (from-to)16:1-16:75
Number of pages75
JournalACM Transactions on Database Systems
Volume49
Issue number4
DOIs
Publication statusPublished - 16 Dec 2024
Externally publishedYes

Keywords

  • entity resolution
  • conflict resolution
  • graph cleaning rules

Fingerprint

Dive into the research topics of 'Making it tractable to detect and correct errors in graphs'. Together they form a unique fingerprint.

Cite this