Abstract
Learning or writing regular expressions to identify instances of a specific
concept within text documents with a high precision and recall is challenging.
It is relatively easy to improve the precision of an initial regular expression
by identifying false positives covered and tweaking the expression to avoid the
false positives. However, modifying the expression to improve recall is difficult
since false negatives can only be identified by manually analyzing all documents,
in the absence of any tools to identify the missing instances. We focus on partially
automating the discovery of missing instances by soliciting minimal user
feedback. We present a technique to identify good generalizations of a regular
expression that have improved recall while retaining high precision. We empirically
demonstrate the effectiveness of the proposed technique as compared to
existing methods and show results for a variety of tasks such as identification of
dates, phone numbers, product names, and course numbers on real world datasets
concept within text documents with a high precision and recall is challenging.
It is relatively easy to improve the precision of an initial regular expression
by identifying false positives covered and tweaking the expression to avoid the
false positives. However, modifying the expression to improve recall is difficult
since false negatives can only be identified by manually analyzing all documents,
in the absence of any tools to identify the missing instances. We focus on partially
automating the discovery of missing instances by soliciting minimal user
feedback. We present a technique to identify good generalizations of a regular
expression that have improved recall while retaining high precision. We empirically
demonstrate the effectiveness of the proposed technique as compared to
existing methods and show results for a variety of tasks such as identification of
dates, phone numbers, product names, and course numbers on real world datasets
Original language | English |
---|---|
Title of host publication | Web Information Systems Engineering - WISE 2012 - 13th International Conference, Paphos, Cyprus, November 28-30, 2012. Proceedings. |
Pages | 455-467 |
Number of pages | 13 |
Publication status | Published - 2012 |
Event | WISE 2012 - Cyprus, Paphos, Cyprus Duration: 28 Nov 2012 → 30 Nov 2012 |
Conference
Conference | WISE 2012 |
---|---|
Country/Territory | Cyprus |
City | Paphos |
Period | 28/11/2012 → 30/11/2012 |