Neural machine translation of low-resource languages using SMT phrase pair injection

Sukanta Sen*, Mohammed Hasanuzzaman, Asif Ekbal, Pushpak Bhattacharyya, Andy Way

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

25 Citations (Scopus)

Abstract

Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi-English, Hindi-Bengali datasets for Health, Tourism, and Judicial (only for Hindi-English) domains. We train our NMT models for 10 translation directions, each using only 5-23k parallel sentences. Experiments show the improvements in the range of 1.38-15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT-which is known to work better than the neural models in low-resource scenarios-for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.

Original languageEnglish
Pages (from-to)271-292
Number of pages22
JournalNatural Language Engineering
Volume27
Issue number3
Early online date17 Jun 2020
DOIs
Publication statusPublished - May 2021
Externally publishedYes

Bibliographical note

Publisher Copyright:
© The Author(s), 2020. Published by Cambridge University Press.

Keywords

  • Machine translation
  • Translation technology

ASJC Scopus subject areas

  • Software
  • Language and Linguistics
  • Linguistics and Language
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Neural machine translation of low-resource languages using SMT phrase pair injection'. Together they form a unique fingerprint.

Cite this