Urdu part of speech tagging using transformation based error driven learning

Fareena Naz*, Waqas Anwar, Usama Ijaz Bajwa, Ehsan Ullah Munir

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

10 Citations (Scopus)

Abstract

This paper presents a preliminary achievement of Brill's Transformation-Based Learning (TBL) approach to solve disambiguation problem of Urdu language. In the last few years lots of work has been done on European and South Asian languages but comparatively lesser efforts have been made in context to Urdu language. Keeping this aspect in mind, this study presents Part of Speech (POS) tagger for Urdu language using Data Driven Approach, called Brill's Transformation-Based Learning (TBL). This method automatically deduces rules from a training corpus with accuracy comparable to other statistical techniques as well as it possesses significant advantages over others tagging approaches. In this study, POS tagger is trained on Urdu corpus, which in contrast to English, is free word order language with inflectional characteristics and complex morphological nature. The corpus consists of 123775 tokens and 36 tag sets. The proposed POS tagger achieved a significant accuracy of around 84%. Precision, Recall and F-Measure has been calculated for complete test corpus. Error analysis (confusion matrix) for most confusing tag pairs has also been presented along with brief overview of Urdu language and tagging examples of Urdu language which elaborates the model in its best fashion. Performance of the proposed tagger has been compared with N-gram POS tagger and it is clearly evident that the proposed transformation based method outperforms the N-gram based POS tagger.

Original languageEnglish
Pages (from-to)437-448
Number of pages12
JournalWorld Applied Sciences Journal
Volume16
Issue number3
Publication statusPublished - 01 Mar 2012
Externally publishedYes

Keywords

  • Statistical models
  • Transformation-Based Learning
  • Urdu Language

ASJC Scopus subject areas

  • General

Fingerprint

Dive into the research topics of 'Urdu part of speech tagging using transformation based error driven learning'. Together they form a unique fingerprint.

Cite this