Dual Learning for Machine Translation on ShortScience.org

3

[link] Summary by tqri 8 years ago

In this article, the authors provide a framework for training two translation models with large accessible monolingual corpus.

In traditional methods, machine translation models always require large parallel corpus to train a good quality model, which is expensive to acquire. However, the massive monolingual data is not fully utilized. The monolingual corpus are typically used in pretraining the NMT decoder rnn and augmenting initial parallel corpus through self-generated translations.

The authors embed machine translation task into a reinforcement learning framework, in which two agents act as two different native speakers respectively and know little about each other and then they learn to translate by trying to communicate with each other.

**The two speakers**, `A` and `B`, obviously know well about their corresponding language respectively, this situation is easily simulated by two well-trained language models for `A` and `B`. Then, speaker `A` tries to tell a sentence $x$ to `B` by translating it into $y$ in `B`'s language. Since they don't know each other, `B` is uncertain about what `A` truly means by saying $y$. However, `B` is capable of evaluate the degree of sensibility of $y$ from his own understanding. Next, `B` informs `A` his sensibility evaluation score and tries to recover what `A` truly means in `A`'s language, i.e. $x'$. And similarly, `A` can also evaluate the degree of sensibility of $x'$ from his own understanding.

In general, the very original idea that `A` tried to convey, is passed through a noisy channel to `B`, and then back to `A` through another noisy channel. The former noisy channel is a `A-B` translation model and the latter a `B-A` translation model in the framework.

Think about how the first American learnt Chinese in history and I think it is intuitively similar to the principle in this work.

How do they derive Eq.7? There appears to be multiple steps here that the paper does not show. (Eq.6 seems straight forward, though)

Your comment:

2

[link] Summary by Denny Britz 8 years ago

TLDR; The authors finetune an FR -> EN NMT model using a RL-based dual game. 1. Pick a French sentence from a monolingual corpus and translate it to EN. 2. Use an EN language model to get a reward for the translation 3. Translate the translation back into FR using an EN -> FR system. 4. Get a reward based on the consistency between original and reconstructed sentence. Training this architecture using Policy Gradient authors can make efficient use of monolingual data and show that a system trained on only 10% of parallel data and finetuned with monolingual data achieves comparable BLUE scores as a system trained on the full set of parallel data.

### Key Points

- Making efficient use of monolingual data to improve NMT systems is a challenge
- Two Agent communication game: Agent A only knows language A and agent B only knows language B. A send message through a noisy translation channel, B receives message, checks its correctness, and sends it back through another noisy translation channel. A checks if it is consistent with the original message. Translation channels are then improves based on the feedback.
- Pieces required: LanguageModel(A), LanguageModel(B), TranslationModel(A->B), TranslationModel(B->A). Monolingual Data.
- Total reward is linear combination of: `r1 = LM(translated_message)`, `r2 = log(P(original_message | translated_message)`
- Samples are based on beam search using the average value as the gradient approximation
- EN -> FR pretrained on 100% of parallel data: 29.92 to 32.06 BLEU
- EN -> FR pretrained on 10% of parallel data: 25.73 to 28.73 BLEU
- FR -> EN pretrained on 100% of parallel data: 27.49 to 29.78 BLEU
- FR -> EN pretrained on 10% of parallel data: 22.27 to 27.50 BLEU

### Some Notes

- I think the idea is very interesting and we'll see a lot related work coming out of this. It would be even more amazing if the architecture was trained from scratch using monolingual data only. Due the the high variance of RL methods this is probably quite hard to do though.
- I think the key issue is that the rewards are quite noisy, as is the case with MT in general. Neither the language model nor the BLEU scores gives good feedback for the "correctness" of a translation.
- I wonder why there is such a huge jump in BLEU scores for FR->EN on 10% of data, but not for EN->FR on the same amount of data.

Your comment:

2

[link] Summary by CodyWild 7 years ago

The problem setting of the paper is the desire to perform translation in a monolingual setting, where datasets exist of each language independently, but little or no paired sentence data (paired here meaning that you know you have the same sentence or text in both languages). The paper outlines the prior methods in this area as being, first, training a single-language language model (i.e. train a model to take in a sentence, and return how coherent of a sentence it is in a given language) and using that to supplement a machine translation system. The authors honestly don’t go into this much, so I can’t tell exactly what they mean by it. The second baseline they talk about is bootstrapping themselves additional training data, by training a model using a small amount of training data, then using that mediocre model to translate additional sentences, which they use as additional training data to train the mediocre model to a higher performance. It doesn’t seem like this should work, but I’ve seen this or similar approaches used in a few cases, and it typically does add benefit. But, the authors claim, they can do better. 

The core intuition of this paper is pretty simple, and will be familiar to anyone who read my summary of CycleGAN, lo these many weeks ago.  Their approach rests on the idea that, even if you can’t push translation models to be objectively correct in a paired sense, you can push translation models to be symmetric with one another, insofar as translating from language A to B (let’s say English to French), and then back from French to English, gets you something in English that looks like your original input. This forces the model to maintain an informative mapping, so that enough information about the English sentence is stored to allow it to be reconstructed.However, unconstrained, the model could just develop a 1:1 word mapping that gives you information about the English input, but doesn’t actually map to the translation in French. If you can additionally confirm that the translation into French looks like a coherent French sentence (which, recall, we can do with a language model trained on French independently), we can get closer to to generating a mapping that is hopefully more coherent. 

One interesting aspect of this paper is the fact that the model they describe is trained with reinforcement learning. Typically, reinforcement learning is used for scenarios where you don’t have direct visibility into how the actions you take impact your loss function. Compare this to a supervised network (where you can take the derivative of your loss with respect to the last layer, and backpropogate that back through to your inputs), or even a GAN, where you can take the derivative of the discriminator-created loss back through the input the discriminator, and through into the GAN that created it. This model treats the translation models that are learned as policies; that is, probability distributions over sets of words. It samples multiple A -> B translations using something called beam search, which, as it samples the sequence of words, samples several at each timestep, and then keeps that chain alive by continuing to sample along it. This helps the sequential translation not fall into a pit where it samples one highly probable word, but then, as you add more words, it doesn’t lead towards a good sentence. This ultimately results in taking multiple (k=12, in this case) samples from each translation distribution, and so the model uses as its objective the expected rewards over these samples, where the rewards is constructed as a combination of in-language coherence loss (scored by using the log likelihood of a trained single-language model) and reconstruction loss (scored by close the A -> B -> A’ is to the original A). 

My confusion about the use of reinforcement loss here mostly comes from the question of whether it just wasn’t possible to build an end to end model, where, like a GAN, you backpropogated through a constructed input, back to the model that constructed it (in this case, through both translator models). Is the issue just that a sequence of words is fundamentally discrete, in a way that images aren’t, and in a way that impedes backprop? That seems possible, but also, I think it’s the case that a typical encoder-decoder model, that outputs softmaxes over words, is able to be backpropogated through. Overall, it’s hard for me to tell if I’m missing something basic about why reinforcement learning is the obvious choice here, or if other more GAN-like approaches were an option, but that meme hadn’t spread into the literature yet, and RL was the more historically canonical choice.  

One other minor disappointing note: it looks like their results are based on a scenario that does use a small number of bilingual training pairs, as a way to pretrain the translation models to a reasonable, non-random starting point. It’s not clear whether this method would have worked with an actual cold start, i.e. a translation model that has no idea what it’s doing, and is using only this as signal. That said, they used a much smaller number of bilingual pairs than a true supervised method, and so even with a need for a warm start, a method like this could still give you leverage over language pairs where there exists some paired data, but not enough to build a full, sophisticated model on top of.

Your comment: