First published: 2016/11/01 (8 years ago) Abstract: While neural machine translation (NMT) is making good progress in the past
two years, tens of millions of bilingual sentence pairs are needed for its
training. However, human labeling is very costly. To tackle this training data
bottleneck, we develop a dual-learning mechanism, which can enable an NMT
system to automatically learn from unlabeled data through a dual-learning game.
This mechanism is inspired by the following observation: any machine
translation task has a dual task, e.g., English-to-French translation (primal)
versus French-to-English translation (dual); the primal and dual tasks can form
a closed loop, and generate informative feedback signals to train the
translation models, even if without the involvement of a human labeler. In the
dual-learning mechanism, we use one agent to represent the model for the primal
task and the other agent to represent the model for the dual task, then ask
them to teach each other through a reinforcement learning process. Based on the
feedback signals generated during this process (e.g., the language-model
likelihood of the output of a model, and the reconstruction error of the
original sentence after the primal and dual translations), we can iteratively
update the two models until convergence (e.g., using the policy gradient
methods). We call the corresponding approach to neural machine translation
\emph{dual-NMT}. Experiments show that dual-NMT works very well on
English$\leftrightarrow$French translation; especially, by learning from
monolingual data (with 10% bilingual data for warm start), it achieves a
comparable accuracy to NMT trained from the full bilingual data for the
French-to-English translation task.
TLDR; The authors finetune an FR -> EN NMT model using a RL-based dual game. 1. Pick a French sentence from a monolingual corpus and translate it to EN. 2. Use an EN language model to get a reward for the translation 3. Translate the translation back into FR using an EN -> FR system. 4. Get a reward based on the consistency between original and reconstructed sentence. Training this architecture using Policy Gradient authors can make efficient use of monolingual data and show that a system trained on only 10% of parallel data and finetuned with monolingual data achieves comparable BLUE scores as a system trained on the full set of parallel data.
### Key Points
- Making efficient use of monolingual data to improve NMT systems is a challenge
- Two Agent communication game: Agent A only knows language A and agent B only knows language B. A send message through a noisy translation channel, B receives message, checks its correctness, and sends it back through another noisy translation channel. A checks if it is consistent with the original message. Translation channels are then improves based on the feedback.
- Pieces required: LanguageModel(A), LanguageModel(B), TranslationModel(A->B), TranslationModel(B->A). Monolingual Data.
- Total reward is linear combination of: `r1 = LM(translated_message)`, `r2 = log(P(original_message | translated_message)`
- Samples are based on beam search using the average value as the gradient approximation
- EN -> FR pretrained on 100% of parallel data: 29.92 to 32.06 BLEU
- EN -> FR pretrained on 10% of parallel data: 25.73 to 28.73 BLEU
- FR -> EN pretrained on 100% of parallel data: 27.49 to 29.78 BLEU
- FR -> EN pretrained on 10% of parallel data: 22.27 to 27.50 BLEU
### Some Notes
- I think the idea is very interesting and we'll see a lot related work coming out of this. It would be even more amazing if the architecture was trained from scratch using monolingual data only. Due the the high variance of RL methods this is probably quite hard to do though.
- I think the key issue is that the rewards are quite noisy, as is the case with MT in general. Neither the language model nor the BLEU scores gives good feedback for the "correctness" of a translation.
- I wonder why there is such a huge jump in BLEU scores for FR->EN on 10% of data, but not for EN->FR on the same amount of data.