[link]
Summary by Gavin Gray 7 years ago
This paper directly relies on understanding [Gumbel-Softmax][gs] ([Concrete][]) sampling.
The place to start thinking about this paper is in terms of a distribution over possible permutations. You could, for example, have a categorical distribution over all the permutation matrices of a given size. Of size 2 this is easy:
```
p(M) 0.1 0.9
M: 1, 0 0, 1
0, 1 1, 0
```
You could apply the Gumbel-Softmax trick to this, and other selections of permutation matrices in small dimensions, but the number of possible permutations grows factorially with the dimension. If you want to infer the order of $100$ items, you now have a categorical over $100!$ variables, and you can't even store that number in floating point.
By breaking up and applying a normalised weight to each permutation matrix, we are effectively doing a [Naive Birkhoff-Von Neumann decomposition][naive] to return a sample from the space of Doubly Stochastic matrices. This paper proposes *much* more efficient ways to sample from this space in a way that is amenable to stochastic variational (read *SGD*) methods, such that they have an experiment working with permutations of 278 variables.
There are two ingredients to a good stochastic variational recipe:
1. A differentiable sampling method; for example (most famously): $x = \mu + \sigma \epsilon \text{ where } \epsilon \sim \mathcal{N}(0,1)$
2. A density over this sampling distribution; in the preceding example: $\mathcal{N}(\mu, \sigma)$
This paper presents two recipes: Stick-breaking transformations and Rounding towards permutation matrices.
Stick-breaking
--------------------
[Shakir Mohamed wrote a good blog post on stick-breaking methods.][sticks]
One problem, which you could guess from the categorical-over-permutations example above, is we can't possibly store a probability for each permutation matrix ($N!$ is a lot of numbers to store). Stick-breaking gives us another way to represent these probabilities implicitly through $B \in [0,1]^{(N-1)\times (N-1)}$. The row `x` gets recursively broken up like this:
```
B_11 B_12*(1-x[:1].sum()) 1-x[:2].sum()
x: |-------|------------------------|---------------------|
x[0] x[1] x[2]
<---------------------------------------------------->
1.0
```
For two dimensions, this becomes a little more complicated (and is actually a novel part of this paper) so I'll just refer you to the paper and say: *this is also possible in two dimensions*.
OK, so now to sample from the distribution of Doubly Stochastic matrices, you just need to sample $(N-1) \times (N-1)$ values in the range $[0,1]$. The authors sample a Gaussian and pass it through a sigmoid. Along with a temperature parameter, the values get pushed closer to 0 or 1 and the result is a permutation matrix.
To get the density, the authors *appear to* (this would probably be annoying in high dimensions) automatically differentiate through the $N^2$ steps of the stick-breaking transformation to get the Jacobian and use change of variables.
Rounding
-------------
This method is more idiosyncratic, so I'll just copy the steps straight from the paper:
> 1. Input $Z \in R^{N \times N} $, $M \in R_+^{N \times N}$, and $V \in R_+^{N \times N}$;
> 2. Map $M \to \tilde{M}$, a point in the Birkhoff polytope, using the [Sinkhorn-Knopp algorithm][sk];
> 3. Set $\Psi = \tilde{M} + V \odot Z$ where $\odot$ denotes elementwise multiplication;
> 4. Find $\text{round}(\Psi)$, the nearest permutation matrix to $\Psi$, using the [Hungarian algorithm][ha];
> 5. Output $X = \tau \Psi + (1- \tau)\text{round}(\Psi)$.
So you can just schedule $\tau \to 0$ and you'll be moving your distribution to be a distribution over permutation matrices.
The big problem with this is we *can't easily get the density*. Step 4 is not differentiable. However, the authors argue the function is still piecewise-linear so we can just get around this. Once they've done that, it's possible to evaluate the density by change of variables again.
Results
----------
On a synthetic permutation matching problem, the rounding method gets a better match to the true posterior (the synthetic problem is small enough to enumerate the true posterior). It also performs better than competing methods on a real matching problem; matching the activations in neurons of C. Elegans to the location of neurons in the known connectome.
[sticks]: http://blog.shakirm.com/2015/12/machine-learning-trick-of-the-day-6-tricks-with-sticks/
[naive]: https://en.wikipedia.org/wiki/Doubly_stochastic_matrix#Birkhoff_polytope_and_Birkhoff.E2.80.93von_Neumann_theorem
[gs]: http://www.shortscience.org/paper?bibtexKey=journals/corr/JangGP16
[concrete]: http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.00712
[sk]: https://en.wikipedia.org/wiki/Sinkhorn%27s_theorem#Sinkhorn-Knopp_algorithm
[ha]: https://en.wikipedia.org/wiki/Hungarian_algorithm
more
less