Reward Augmented Maximum Likelihood for Neural Structured Prediction
Mohammad Norouzi
and
Samy Bengio
and
Zhifeng Chen
and
Navdeep Jaitly
and
Mike Schuster
and
Yonghui Wu
and
Dale Schuurmans
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.LG
First published: 2016/09/01 (8 years ago) Abstract: A key problem in structured output prediction is direct optimization of the
task reward function that matters for test evaluation. This paper presents a
simple and computationally efficient approach to incorporate task reward into a
maximum likelihood framework. We establish a connection between the
log-likelihood and regularized expected reward objectives, showing that at a
zero temperature, they are approximately equivalent in the vicinity of the
optimal solution. We show that optimal regularized expected reward is achieved
when the conditional distribution of the outputs given the inputs is
proportional to their exponentiated (temperature adjusted) rewards. Based on
this observation, we optimize conditional log-probability of edited outputs
that are sampled proportionally to their scaled exponentiated reward. We apply
this framework to optimize edit distance in the output label space. Experiments
on speech recognition and machine translation for neural sequence to sequence
models show notable improvements over a maximum likelihood baseline by using
edit distance augmented maximum likelihood.
Assume you have both label and reward (some flexibility in accepting the output) as is the case for hamming distance and negative edit distance.
Assume the output space is discrete, it can be a quantized n-dim space.
Assume that given an input and a set of labels, computing the probability of the labels is $O(L)$, meaning that most of the computation is independent of the label. (not sure about this! You certainly don't want to sample from the network)
Suppose that you want to maximize an approximate objective which is a tradeoff between optimizing the reward or the label and you control this tradeoff by one parameter $\tau$ which you fix beforehand. This actually only directly affects the reward.
Then you can compute the loss, by sampling from each discrete value of reward and then sampling from those labels that have such reward.
Therefore, you can quickly approximate the loss and the gradient through sampling. And the samples are independent of the output of the network, but rather fixed given $\tau$ and ground-truth. This is as opposed to sampling the network and sampling around the predicted label.
#### 2.2 Sampling from the exponentiated payoff distribution
Note that $q$ should be defined such that the entropy is fixed given $\tau$. The paper notes that given the ground-truth label, if the loss is hamming distance or negative edit distance, then samples can be put into buckets of same loss (loss is discrete). Then One possible family of distributions controlled by $\tau$ is given by a normalized count reweighted by $\exp\{e/\tau\}$ for each bucket. By varying $\tau$, you can go from uniform distribution ($\tau=\infty$) to delta ($\tau=0$). Eq. 11 shows the simplified count for each bucket to be used for normalization and further sampling. As noted above the equation, edge cases like repetitions is ignored in this count to make the computation easy.
Given a fixed entropy $q$, then you can compute the RML loss by sampling from $q$ by first selecting an edit distance, and then sampling from the corresponding bucket.
#### Footnotes
I was not satisfied with the results. I didn't quite follow section 3.