Abstract
This paper presents a preliminary achievement of Brill's Transformation-Based Learning (TBL) approach to solve disambiguation problem of Urdu language. In the last few years lots of work has been done on European and South Asian languages but comparatively lesser efforts have been made in context to Urdu language. Keeping this aspect in mind, this study presents Part of Speech (POS) tagger for Urdu language using Data Driven Approach, called Brill's Transformation-Based Learning (TBL). This method automatically deduces rules from a training corpus with accuracy comparable to other statistical techniques as well as it possesses significant advantages over others tagging approaches. In this study, POS tagger is trained on Urdu corpus, which in contrast to English, is free word order language with inflectional characteristics and complex morphological nature. The corpus consists of 123775 tokens and 36 tag sets. The proposed POS tagger achieved a significant accuracy of around 84%. Precision, Recall and F-Measure has been calculated for complete test corpus. Error analysis (confusion matrix) for most confusing tag pairs has also been presented along with brief overview of Urdu language and tagging examples of Urdu language which elaborates the model in its best fashion. Performance of the proposed tagger has been compared with N-gram POS tagger and it is clearly evident that the proposed transformation based method outperforms the N-gram based POS tagger.
Original language | English |
---|---|
Pages (from-to) | 437-448 |
Number of pages | 12 |
Journal | World Applied Sciences Journal |
Volume | 16 |
Issue number | 3 |
Publication status | Published - 01 Mar 2012 |
Externally published | Yes |
Keywords
- Statistical models
- Transformation-Based Learning
- Urdu Language
ASJC Scopus subject areas
- General