[link]
Summary by Gavin Gray 4 years ago
It's a shame that the authors weren't able to continue their series of [great][reinforce] [paper][attention] [titles][beams], although it looks like they thought about calling this paper **"Put Replacement In Your Basement"**. Also, although they don't say it in the title or abstract, this paper introduces an estimator the authors call the **"unordered set estimator"** which, as a name, is not the best. However, this is one of the most exciting estimators for gradients of non-differentiable expectations over structured discrete distributions (that sounds specific but includes a vast number of important problems, for example see [the first paragraph of this paper][mcts] and, less important but fun, adversarial examples or GANS on sequences).
For a distribution $p_\theta(s)$, where $s$ can take some set of discrete states, this paper is concerned with how to compute the gradient of the expectation of a function, $f$, over $p_\theta(s)$:
$$
\nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right]
$$
To jump directly to the definition of the estimator I've got to define the *leave-one-out* ratio. Given a distribution $p$ over unordered sets $S^k$, the leave-one-out ratio is
$$
R(S^k, s) = \frac{p^{D \backslash \{ s \}} (S^k \backslash \{ s \} )}{p(S^k)}.
$$
Where $p^{D \backslash \{ s \}}$ is the distribution over sets that don't contain $s$, which is the value of the first sampled element. For a given element $s$, it's the ratio of probability of sampling the other elements *given $s$* in $S^k$ divided by the probability of sampling $S^k$.
Then the *unordered set estimator* is
$$
\nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] = \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} p(s) R(S^k, s) f(s) \right] \\
= \mathbb{E}_{s \sim p_\theta(s)} \left[ \nabla_\theta p_\theta (s) R(S^k, s) f(s) \right],
$$
and they proceed to show that it's a Rao-Blackwellization (so guaranteed as low or lower variance) than a single sample estimator, importance weighted estimators and the [the RB discrete estimator][rb].
The authors also show that they can use the multiple samples to compute a built-in baseline and further reduce variance. If we define the leave-one-out ratio on a distribution with one element already left out as $R^{D \backslash \{ s \} } (S^k, s')$ (the *second-order* leave-one-out ratio):
$$
\nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] \\
= \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} \nabla_\theta p_\theta (s) R(S^k, s) \left( f(s) - \sum_{s' \in S^k} p_{\theta} (s') R^{D \backslash \{ s \} } (S^k, s') f(s') \right) \right]
$$
Thanks to [stochastic beam search][beams] these sets can be sampled quickly and easily, so it all adds up to a powerful generic method. If I need gradients in an SGD setting of an expectation involving discrete variables, and I can take multiple samples, this is the most promising method I know about at the moment.
[reinforce]: https://openreview.net/forum?id=r1lgTGL5DE
[attention]: https://arxiv.org/abs/1803.08475
[beams]: https://arxiv.org/abs/1903.06059
[mcts]: https://arxiv.org/pdf/1910.06862.pdf
[rb]: https://arxiv.org/abs/1810.04777
more
less