Improving Recall of Regular Expressions for Information Extraction

    Research output: Chapter in Book/Report/Conference proceedingConference contribution


    View graph of relations

    Learning or writing regular expressions to identify instances of a specific
    concept within text documents with a high precision and recall is challenging.
    It is relatively easy to improve the precision of an initial regular expression
    by identifying false positives covered and tweaking the expression to avoid the
    false positives. However, modifying the expression to improve recall is difficult
    since false negatives can only be identified by manually analyzing all documents,
    in the absence of any tools to identify the missing instances. We focus on partially
    automating the discovery of missing instances by soliciting minimal user
    feedback. We present a technique to identify good generalizations of a regular
    expression that have improved recall while retaining high precision. We empirically
    demonstrate the effectiveness of the proposed technique as compared to
    existing methods and show results for a variety of tasks such as identification of
    dates, phone numbers, product names, and course numbers on real world datasets
    Original languageEnglish
    Title of host publicationWeb Information Systems Engineering - WISE 2012 - 13th International Conference, Paphos, Cyprus, November 28-30, 2012. Proceedings.
    Number of pages13
    Publication statusPublished - 2012
    EventWISE 2012 - Cyprus, Paphos, Cyprus
    Duration: 28 Nov 201230 Nov 2012


    ConferenceWISE 2012

    ID: 17867641