BERT based language identification in code-mixed english-assamese social media text

  • Nayan Jyoti Kalita*
  • , Pritam Deka
  • , Vijay Chennareddy
  • , Shikhar Kumar Sarma
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Language identification in code-mixed language pairs has progressively gained research interest in recent times. Due to the extensive use of social media, it has become necessary to identify languages in code-mixed text for dealing with tasks such as detection of hate speeches, misinformation, and disinformation. Recent transformer models such as BERT have shown very good results in many NLP tasks including language identification. This work uses a transfer learning approach by applying a BERT model for language identification at a word level in a code-mixed Assamese-English language pair. Experimental results performed with an available data set show that BERT performs better than using word-level features or semantic word embeddings with an accuracy of 94%.

Original languageEnglish
Title of host publicationMachine Intelligence and Data Science Applications (MIDAS 2022): Proceedings
EditorsAmar Ramdane-Cherif, T. P. Singh, Ravi Tomar, Tanupriya Choudhury, Jung-Sup Um
PublisherSpringer Singapore
Pages173-181
Number of pages9
ISBN (Electronic)9789819916207
ISBN (Print)9789819916191, 9789819916221
DOIs
Publication statusPublished - 02 Sept 2023
Event3rd International Conference on Machine Intelligence & Data Science Applications (Midas - 2022) - Paris, France
Duration: 28 Oct 202229 Oct 2022

Publication series

NameAlgorithms for Intelligent Systems: MIDAS: Workshop on Mining Data for Financial Applications
PublisherSpringer Singapore
ISSN (Print)2524-7565
ISSN (Electronic)2524-7573

Conference

Conference3rd International Conference on Machine Intelligence & Data Science Applications (Midas - 2022)
Country/TerritoryFrance
CityParis
Period28/10/202229/10/2022

Fingerprint

Dive into the research topics of 'BERT based language identification in code-mixed english-assamese social media text'. Together they form a unique fingerprint.

Cite this