Unsupervised Separation of Transliterable and Native Words for Malayalam

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Differentiating intrinsic language words from transliterable words is a key step aiding text processing tasks involving different natural languages. We consider the problem of unsupervised separation of transliterable words from native words for text in Malayalam language. Outlining a key observation on the diversity of characters beyond the word stem, we develop an optimization method to score words based on their nativeness. Our method relies on the usage of probability distributions over character n-grams that are refined in step with the nativeness scorings in an iterative optimization formulation. Using an empirical evaluation, we illustrate that our method, DTIM, provides significant improvements in nativeness scoring for Malayalam, establishing DTIM as the preferred method for the task.
LanguageEnglish
Title of host publicationProceedings of the 14th International Conference on Natural Language Processing (ICON 2017)
Pages155-164
Number of pages10
Publication statusPublished - 21 Dec 2017
EventICON 2017 - Kolkata, Kolkata, India
Duration: 18 Dec 201721 Dec 2017
https://ltrc.iiit.ac.in/icon2017/

Conference

ConferenceICON 2017
CountryIndia
CityKolkata
Period18/12/201721/12/2017
Internet address

Fingerprint

Text processing
Probability distributions

Cite this

Padmanabhan, D. (2017). Unsupervised Separation of Transliterable and Native Words for Malayalam. In Proceedings of the 14th International Conference on Natural Language Processing (ICON 2017) (pp. 155-164)
Padmanabhan, Deepak. / Unsupervised Separation of Transliterable and Native Words for Malayalam. Proceedings of the 14th International Conference on Natural Language Processing (ICON 2017). 2017. pp. 155-164
@inproceedings{3057c00812344b0b9deec684e03af1b9,
title = "Unsupervised Separation of Transliterable and Native Words for Malayalam",
abstract = "Differentiating intrinsic language words from transliterable words is a key step aiding text processing tasks involving different natural languages. We consider the problem of unsupervised separation of transliterable words from native words for text in Malayalam language. Outlining a key observation on the diversity of characters beyond the word stem, we develop an optimization method to score words based on their nativeness. Our method relies on the usage of probability distributions over character n-grams that are refined in step with the nativeness scorings in an iterative optimization formulation. Using an empirical evaluation, we illustrate that our method, DTIM, provides significant improvements in nativeness scoring for Malayalam, establishing DTIM as the preferred method for the task.",
author = "Deepak Padmanabhan",
year = "2017",
month = "12",
day = "21",
language = "English",
pages = "155--164",
booktitle = "Proceedings of the 14th International Conference on Natural Language Processing (ICON 2017)",

}

Padmanabhan, D 2017, Unsupervised Separation of Transliterable and Native Words for Malayalam. in Proceedings of the 14th International Conference on Natural Language Processing (ICON 2017). pp. 155-164, ICON 2017, Kolkata, India, 18/12/2017.

Unsupervised Separation of Transliterable and Native Words for Malayalam. / Padmanabhan, Deepak.

Proceedings of the 14th International Conference on Natural Language Processing (ICON 2017). 2017. p. 155-164.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Unsupervised Separation of Transliterable and Native Words for Malayalam

AU - Padmanabhan, Deepak

PY - 2017/12/21

Y1 - 2017/12/21

N2 - Differentiating intrinsic language words from transliterable words is a key step aiding text processing tasks involving different natural languages. We consider the problem of unsupervised separation of transliterable words from native words for text in Malayalam language. Outlining a key observation on the diversity of characters beyond the word stem, we develop an optimization method to score words based on their nativeness. Our method relies on the usage of probability distributions over character n-grams that are refined in step with the nativeness scorings in an iterative optimization formulation. Using an empirical evaluation, we illustrate that our method, DTIM, provides significant improvements in nativeness scoring for Malayalam, establishing DTIM as the preferred method for the task.

AB - Differentiating intrinsic language words from transliterable words is a key step aiding text processing tasks involving different natural languages. We consider the problem of unsupervised separation of transliterable words from native words for text in Malayalam language. Outlining a key observation on the diversity of characters beyond the word stem, we develop an optimization method to score words based on their nativeness. Our method relies on the usage of probability distributions over character n-grams that are refined in step with the nativeness scorings in an iterative optimization formulation. Using an empirical evaluation, we illustrate that our method, DTIM, provides significant improvements in nativeness scoring for Malayalam, establishing DTIM as the preferred method for the task.

M3 - Conference contribution

SP - 155

EP - 164

BT - Proceedings of the 14th International Conference on Natural Language Processing (ICON 2017)

ER -

Padmanabhan D. Unsupervised Separation of Transliterable and Native Words for Malayalam. In Proceedings of the 14th International Conference on Natural Language Processing (ICON 2017). 2017. p. 155-164