Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT
Yining Wang
and
Long Zhou
and
Jiajun Zhang
and
Chengqing Zong
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.CL
First published: 2017/11/13 (6 years ago) Abstract: Neural machine translation (NMT), a new approach to machine translation, has
been proved to outperform conventional statistical machine translation (SMT)
across a variety of language pairs. Translation is an open-vocabulary problem,
but most existing NMT systems operate with a fixed vocabulary, which causes the
incapability of translating rare words. This problem can be alleviated by using
different translation granularities, such as character, subword and hybrid
word-character. Translation involving Chinese is one of the most difficult
tasks in machine translation, however, to the best of our knowledge, there has
not been any other work exploring which translation granularity is most
suitable for Chinese in NMT. In this paper, we conduct an extensive comparison
using Chinese-English NMT as a case study. Furthermore, we discuss the
advantages and disadvantages of various translation granularities in detail.
Our experiments show that subword model performs best for Chinese-to-English
translation with the vocabulary which is not so big while hybrid word-character
model is most suitable for English-to-Chinese translation. Moreover,
experiments of different granularities show that Hybrid_BPE method can achieve
best result on Chinese-to-English translation task.