Web Data Extraction from Query Result Pages Based on Visual and Content Features

Daiyue Weng, Jun Hong, David Bell

Research output: Contribution to journalArticle

Abstract

A rapidly increasing number of Web databases are now become accessible via
their HTML form-based query interfaces. Query result pages are dynamically generated
in response to user queries, which encode structured data and are displayed for human
use. Query result pages usually contain other types of information in addition to query
results, e.g., advertisements, navigation bar etc. The problem of extracting structured data
from query result pages is critical for web data integration applications, such as comparison
shopping, meta-search engines etc, and has been intensively studied. A number of approaches
have been proposed. As the structures of Web pages become more and more complex, the
existing approaches start to fail, and most of them do not remove irrelevant contents which
may a®ect the accuracy of data record extraction. We propose an automated approach for
Web data extraction. First, it makes use of visual features and query terms to identify data
sections and extracts data records in these sections. We also represent several content and
visual features of visual blocks in a data section, and use them to ¯lter out noisy blocks.
Second, it measures similarity between data items in di®erent data records based on their
visual and content features, and aligns them into di®erent groups so that the data in the
same group have the same semantics. The results of our experiments with a large set of
Web query result pages in di®erent domains show that our proposed approaches are highly
e®ective.
Original languageEnglish
Pages (from-to)453-472
Number of pages20
JournalInternational Journal of Software and Informatics
Volume6
Issue number3
Publication statusPublished - 2012

Fingerprint Dive into the research topics of 'Web Data Extraction from Query Result Pages Based on Visual and Content Features'. Together they form a unique fingerprint.

  • Cite this