Abstract
We address the problem of mining interesting phrases from
subsets of a text corpus where the subset is specified using
a set of features such as keywords that form a query. Previous
algorithms for the problem have proposed solutions
that involve sifting through a phrase dictionary based index
or a document-based index where the solution is linear
in either the phrase dictionary size or the size of the document
subset. We propose the usage of an independence
assumption between query keywords given the top correlated
phrases, wherein the pre-processing could be reduced
to discovering phrases from among the top phrases per each
feature in the query. We then outline an indexing mechanism
where per-keyword phrase lists are stored either in disk
or memory, so that popular aggregation algorithms such as
No Random Access and Sort-merge Join may be adapted
to do the scoring at real-time to identify the top interesting
phrases. Though such an approach is expected to be approximate,
we empirically illustrate that very high accuracies (of
over 90%) are achieved against the results of exact algorithms.
Due to the simplified list-aggregation, we are also
able to provide response times that are orders of magnitude
better than state-of-the-art algorithms. Interestingly, our
disk-based approach outperforms the in-memory baselines
by up to hundred times and sometimes more, confirming
the superiority of the proposed method.
Original language | English |
---|---|
Title of host publication | Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology Athens, Greece, March 24-28, 2014 Proceedings |
Editors | Sihem Amer-Yahia, Et al. |
Publisher | Open Proceedings |
Pages | 193-204 |
Number of pages | 12 |
ISBN (Print) | 9783893180653 |
DOIs | |
Publication status | Published - 2014 |
Event | EDBT 2014 - Greece, Athens, Greece Duration: 24 Mar 2014 → 28 Mar 2014 |
Conference
Conference | EDBT 2014 |
---|---|
Country/Territory | Greece |
City | Athens |
Period | 24/03/2014 → 28/03/2014 |