Reward Augmented Maximum Likelihood for Neural Structured Prediction on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Reward Augmented Maximum Likelihood for Neural Structured Prediction
Mohammad Norouzi and Samy Bengio and Zhifeng Chen and Navdeep Jaitly and Mike Schuster and Yonghui Wu and Dale Schuurmans
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

Summaries/Notes 3

[link] Summary by Fartash 8 years ago

Assume you have both label and reward (some flexibility in accepting the output) as is the case for hamming distance and negative edit distance.

Assume the output space is discrete, it can be a quantized n-dim space.

Assume that given an input and a set of labels, computing the probability of the labels is $O(L)$, meaning that most of the computation is independent of the label. (not sure about this! You certainly don't want to sample from the network)

Suppose that you want to maximize an approximate objective which is a tradeoff between optimizing the reward or the label and you control this tradeoff by one parameter $\tau$ which you fix beforehand. This actually only directly affects the reward.

Then you can compute the loss, by sampling from each discrete value of reward and then sampling from those labels that have such reward.

Therefore, you can quickly approximate the loss and the gradient through sampling. And the samples are independent of the output of the network, but rather fixed given $\tau$ and ground-truth. This is as opposed to sampling the network and sampling around the predicted label.

#### 2.2 Sampling from the exponentiated payoff distribution

Note that $q$ should be defined such that the entropy is fixed given $\tau$. The paper notes that given the ground-truth label, if the loss is hamming distance or negative edit distance, then samples can be put into buckets of same loss (loss is discrete). Then One possible family of distributions controlled by $\tau$ is given by a normalized count reweighted by $\exp\{e/\tau\}$ for each bucket. By varying $\tau$, you can go from uniform distribution ($\tau=\infty$) to delta ($\tau=0$). Eq. 11 shows the simplified count for each bucket to be used for normalization and further sampling. As noted above the equation, edge cases like repetitions is ignored in this count to make the computation easy.

Given a fixed entropy $q$, then you can compute the RML loss by sampling from $q$ by first selecting an edit distance, and then sampling from the corresponding bucket.

#### Footnotes
I was not satisfied with the results. I didn't quite follow section 3.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private