Abstract
Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi-English, Hindi-Bengali datasets for Health, Tourism, and Judicial (only for Hindi-English) domains. We train our NMT models for 10 translation directions, each using only 5-23k parallel sentences. Experiments show the improvements in the range of 1.38-15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT-which is known to work better than the neural models in low-resource scenarios-for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.
Original language | English |
---|---|
Pages (from-to) | 271-292 |
Number of pages | 22 |
Journal | Natural Language Engineering |
Volume | 27 |
Issue number | 3 |
Early online date | 17 Jun 2020 |
DOIs | |
Publication status | Published - May 2021 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© The Author(s), 2020. Published by Cambridge University Press.
Keywords
- Machine translation
- Translation technology
ASJC Scopus subject areas
- Software
- Language and Linguistics
- Linguistics and Language
- Artificial Intelligence