Improving Recall of Regular Expressions for Information Extraction

Karin Murthy, Deepak Padmanabhan, Prasad Deshpande

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Citations (Scopus)

Abstract

Learning or writing regular expressions to identify instances of a specific
concept within text documents with a high precision and recall is challenging.
It is relatively easy to improve the precision of an initial regular expression
by identifying false positives covered and tweaking the expression to avoid the
false positives. However, modifying the expression to improve recall is difficult
since false negatives can only be identified by manually analyzing all documents,
in the absence of any tools to identify the missing instances. We focus on partially
automating the discovery of missing instances by soliciting minimal user
feedback. We present a technique to identify good generalizations of a regular
expression that have improved recall while retaining high precision. We empirically
demonstrate the effectiveness of the proposed technique as compared to
existing methods and show results for a variety of tasks such as identification of
dates, phone numbers, product names, and course numbers on real world datasets
Original languageEnglish
Title of host publicationWeb Information Systems Engineering - WISE 2012 - 13th International Conference, Paphos, Cyprus, November 28-30, 2012. Proceedings.
Pages455-467
Number of pages13
Publication statusPublished - 2012
EventWISE 2012 - Cyprus, Paphos, Cyprus
Duration: 28 Nov 201230 Nov 2012

Conference

ConferenceWISE 2012
CountryCyprus
CityPaphos
Period28/11/201230/11/2012

Cite this

Murthy, K., Padmanabhan, D., & Deshpande, P. (2012). Improving Recall of Regular Expressions for Information Extraction. In Web Information Systems Engineering - WISE 2012 - 13th International Conference, Paphos, Cyprus, November 28-30, 2012. Proceedings. (pp. 455-467) https://sites.google.com/site/deepakp7/publications/wise.pdf?attredirects=0