Addressing the Rare Word Problem in Neural Machine Translation
Luong, Thang
and
Sutskever, Ilya
and
Le, Quoc V.
and
Vinyals, Oriol
and
Zaremba, Wojciech
Association for Computational Linguistics - 2015 via Local Bibsonomy
Keywords:
dblp
# Addressing the Rare Word Problem in Neural Machine Translation
## Introduction
* NMT(Neural Machine Translation) systems perform poorly with respect to OOV(out-of-vocabulary) words or rare words.
* The paper presents a word-alignment based technique for translating such rare words.
* [Link to the paper](https://arxiv.org/abs/1410.8206)
## Technique
* Annotate the training corpus with information about what do different OOV words (in the target sentence) correspond to in the source sentence.
* NMT learns to track the alignment of rare words across source and target sentences and emits such alignments for the test sentences.
* As a post-processing step, use a dictionary to map rare words from the source language to target language.
## Annotating the Corpus
### Copy Model
* Annotate the OOV words in the source sentence with tokens *unk1*, *unk2*,..., etc such that repeated words get the same token.
* In target language, each OOV word, that is aligned to some OOV word in the source language, is assigned the same token as the word in the source language.
* The OOV word in the target language, which has no alignment or is aligned with a known word in the source language. is assigned the null token.
* Pros
* Very straightforward
* Cons
* Misses out on words which are not labelled as OOV in the source language.
### PosAll - Positional All Model
* All OOV words in the source language are assigned a single *unk* token.
* All words in the target sentences are assigned positional tokens which denote that the *jth* word in the target sentence is aligned to the *ith* word in the source sentence.
* Aligned words that are too far apart, or are unaligned, are assigned a null token.
* Pros
* Captures complete alignment between source and target sentences.
* Cons
* It doubles the length of target sentences.
### PosUnk - Positional Unknown Model
* All OOV words in the source language are assigned a single *unk* token.
* All OOV words in the target sentences are assigned *unk* token with the position which gives the relative position of the word in the target language with respect to its aligned source word.
* Pros:
* Faster than PosAll model.
* Cons
* Does not capture alignment for all words.
## Experiments
* Dataset
* Subset of WMT'14 dataset
* Alignment computed using the [Berkeley Aligner](https://code.google.com/archive/p/berkeleyaligner/)
* Used architecture from [Sequence to Sequence Learning with Neural Networks paper](https://gist.github.com/shagunsodhani/a2915921d7d0ac5cfd0e379025acfb9f).
## Results
* All the 3 approaches (more specifically the PosUnk approach) improve the performance of existing NMTs in the order PosUnk > PosAll > Copy.
* Ensemble models benefit more than individual models as the ensemble of NMT models works better at aligning the OOV words.
* Performance gains are more when using smaller vocabulary.
* Rare word analysis shows that performance gains are more when proposition of OOV words is higher.