Fast Mining of Interesting Phrases from Subsets of Text Corpora

Deepak Padmanabhan, Atreyee Dey, Debapriyo Majumdar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

We address the problem of mining interesting phrases from subsets of a text corpus where the subset is specified using a set of features such as keywords that form a query. Previous algorithms for the problem have proposed solutions that involve sifting through a phrase dictionary based index or a document-based index where the solution is linear in either the phrase dictionary size or the size of the document subset. We propose the usage of an independence assumption between query keywords given the top correlated phrases, wherein the pre-processing could be reduced to discovering phrases from among the top phrases per each feature in the query. We then outline an indexing mechanism where per-keyword phrase lists are stored either in disk or memory, so that popular aggregation algorithms such as No Random Access and Sort-merge Join may be adapted to do the scoring at real-time to identify the top interesting phrases. Though such an approach is expected to be approximate, we empirically illustrate that very high accuracies (of over 90%) are achieved against the results of exact algorithms. Due to the simplified list-aggregation, we are also able to provide response times that are orders of magnitude better than state-of-the-art algorithms. Interestingly, our disk-based approach outperforms the in-memory baselines by up to hundred times and sometimes more, confirming the superiority of the proposed method.
LanguageEnglish
Title of host publicationAdvances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology Athens, Greece, March 24-28, 2014 Proceedings
EditorsSihem Amer-Yahia, Et al.
PublisherOpen Proceedings
Pages193-204
Number of pages12
ISBN (Print)9783893180653
DOIs
Publication statusPublished - 2014
EventEDBT 2014 - Greece, Athens, Greece
Duration: 24 Mar 201428 Mar 2014

Conference

ConferenceEDBT 2014
CountryGreece
CityAthens
Period24/03/201428/03/2014

Fingerprint

Glossaries
Agglomeration
Data storage equipment
Processing

Cite this

Padmanabhan, D., Dey, A., & Majumdar, D. (2014). Fast Mining of Interesting Phrases from Subsets of Text Corpora. In S. Amer-Yahia, & E. al. (Eds.), Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology Athens, Greece, March 24-28, 2014 Proceedings (pp. 193-204). Open Proceedings. https://doi.org/10.5441/002/edbt.2014.18
Padmanabhan, Deepak ; Dey, Atreyee ; Majumdar, Debapriyo. / Fast Mining of Interesting Phrases from Subsets of Text Corpora. Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology Athens, Greece, March 24-28, 2014 Proceedings. editor / Sihem Amer-Yahia ; Et al. Open Proceedings, 2014. pp. 193-204
@inproceedings{0e98c4dc7ec5471988cfdec1340ece5d,
title = "Fast Mining of Interesting Phrases from Subsets of Text Corpora",
abstract = "We address the problem of mining interesting phrases from subsets of a text corpus where the subset is specified using a set of features such as keywords that form a query. Previous algorithms for the problem have proposed solutions that involve sifting through a phrase dictionary based index or a document-based index where the solution is linear in either the phrase dictionary size or the size of the document subset. We propose the usage of an independence assumption between query keywords given the top correlated phrases, wherein the pre-processing could be reduced to discovering phrases from among the top phrases per each feature in the query. We then outline an indexing mechanism where per-keyword phrase lists are stored either in disk or memory, so that popular aggregation algorithms such as No Random Access and Sort-merge Join may be adapted to do the scoring at real-time to identify the top interesting phrases. Though such an approach is expected to be approximate, we empirically illustrate that very high accuracies (of over 90{\%}) are achieved against the results of exact algorithms. Due to the simplified list-aggregation, we are also able to provide response times that are orders of magnitude better than state-of-the-art algorithms. Interestingly, our disk-based approach outperforms the in-memory baselines by up to hundred times and sometimes more, confirming the superiority of the proposed method.",
author = "Deepak Padmanabhan and Atreyee Dey and Debapriyo Majumdar",
year = "2014",
doi = "10.5441/002/edbt.2014.18",
language = "English",
isbn = "9783893180653",
pages = "193--204",
editor = "Sihem Amer-Yahia and Et al.",
booktitle = "Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology Athens, Greece, March 24-28, 2014 Proceedings",
publisher = "Open Proceedings",

}

Padmanabhan, D, Dey, A & Majumdar, D 2014, Fast Mining of Interesting Phrases from Subsets of Text Corpora. in S Amer-Yahia & E al. (eds), Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology Athens, Greece, March 24-28, 2014 Proceedings. Open Proceedings, pp. 193-204, EDBT 2014, Athens, Greece, 24/03/2014. https://doi.org/10.5441/002/edbt.2014.18

Fast Mining of Interesting Phrases from Subsets of Text Corpora. / Padmanabhan, Deepak; Dey, Atreyee; Majumdar, Debapriyo.

Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology Athens, Greece, March 24-28, 2014 Proceedings. ed. / Sihem Amer-Yahia; Et al. Open Proceedings, 2014. p. 193-204.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Fast Mining of Interesting Phrases from Subsets of Text Corpora

AU - Padmanabhan, Deepak

AU - Dey, Atreyee

AU - Majumdar, Debapriyo

PY - 2014

Y1 - 2014

N2 - We address the problem of mining interesting phrases from subsets of a text corpus where the subset is specified using a set of features such as keywords that form a query. Previous algorithms for the problem have proposed solutions that involve sifting through a phrase dictionary based index or a document-based index where the solution is linear in either the phrase dictionary size or the size of the document subset. We propose the usage of an independence assumption between query keywords given the top correlated phrases, wherein the pre-processing could be reduced to discovering phrases from among the top phrases per each feature in the query. We then outline an indexing mechanism where per-keyword phrase lists are stored either in disk or memory, so that popular aggregation algorithms such as No Random Access and Sort-merge Join may be adapted to do the scoring at real-time to identify the top interesting phrases. Though such an approach is expected to be approximate, we empirically illustrate that very high accuracies (of over 90%) are achieved against the results of exact algorithms. Due to the simplified list-aggregation, we are also able to provide response times that are orders of magnitude better than state-of-the-art algorithms. Interestingly, our disk-based approach outperforms the in-memory baselines by up to hundred times and sometimes more, confirming the superiority of the proposed method.

AB - We address the problem of mining interesting phrases from subsets of a text corpus where the subset is specified using a set of features such as keywords that form a query. Previous algorithms for the problem have proposed solutions that involve sifting through a phrase dictionary based index or a document-based index where the solution is linear in either the phrase dictionary size or the size of the document subset. We propose the usage of an independence assumption between query keywords given the top correlated phrases, wherein the pre-processing could be reduced to discovering phrases from among the top phrases per each feature in the query. We then outline an indexing mechanism where per-keyword phrase lists are stored either in disk or memory, so that popular aggregation algorithms such as No Random Access and Sort-merge Join may be adapted to do the scoring at real-time to identify the top interesting phrases. Though such an approach is expected to be approximate, we empirically illustrate that very high accuracies (of over 90%) are achieved against the results of exact algorithms. Due to the simplified list-aggregation, we are also able to provide response times that are orders of magnitude better than state-of-the-art algorithms. Interestingly, our disk-based approach outperforms the in-memory baselines by up to hundred times and sometimes more, confirming the superiority of the proposed method.

U2 - 10.5441/002/edbt.2014.18

DO - 10.5441/002/edbt.2014.18

M3 - Conference contribution

SN - 9783893180653

SP - 193

EP - 204

BT - Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology Athens, Greece, March 24-28, 2014 Proceedings

A2 - Amer-Yahia, Sihem

A2 - al., Et

PB - Open Proceedings

ER -

Padmanabhan D, Dey A, Majumdar D. Fast Mining of Interesting Phrases from Subsets of Text Corpora. In Amer-Yahia S, al. E, editors, Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology Athens, Greece, March 24-28, 2014 Proceedings. Open Proceedings. 2014. p. 193-204 https://doi.org/10.5441/002/edbt.2014.18