Visually extracting data records from the deep web

Neil Anderson, Jun Hong

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)

Abstract

Web sites that rely on databases for their content are now ubiquitous. Query result pages are dynamically generated from these databases in response to user-submitted queries. Automatically extracting structured data from query result pages is a challenging problem, as the structure of the data is not explicitly represented. While humans have shown good intuition in visually understanding data records on a query result page as displayed by a web browser, no existing approach to data record extraction has made full use of this intuition. We propose a novel approach, in which we make use of the common sources of evidence that humans use to understand data records on a displayed query result page. These include structural regularity, and visual and content similarity between data records displayed on a query result page. Based on these observations we propose new techniques that can identify each data record individually, while ignoring noise items, such as navigation bars and adverts. We have implemented these techniques in a software prototype, rExtractor, and tested it using two datasets. Our experimental results show that our approach achieves significantly higher accuracy than previous approaches. Furthermore, it establishes the case for use of vision-based algorithms in the context of data extraction from web sites.
Original languageEnglish
Title of host publicationWWW '13 Companion Proceedings of the 22nd International Conference on World Wide Web
PublisherAssociation for Computing Machinery
Pages1233-1238
Number of pages6
ISBN (Print)978-1-4503-2038-2
Publication statusPublished - 2013
Event22nd International World Wide Web Conference - Windsor Barra Hotel, Rio de Janeiro, Brazil
Duration: 13 May 201317 May 2013

Conference

Conference22nd International World Wide Web Conference
Country/TerritoryBrazil
CityRio de Janeiro
Period13/05/201317/05/2013

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Visually extracting data records from the deep web'. Together they form a unique fingerprint.

Cite this