[link]
Summary by Gavin Gray 7 years ago
The propagation rule used in this paper is the following:
$$
H^l = \sigma \left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{l-1} W^l \right)
$$
Where $\tilde{A}$ is the [adjacency matrix][adj] of the undirected graph (with self connections, so has a diagonal of 1s and is symmetric) and $H^l$ are the hidden activations at layer $l$. The $D$ matrices are performing row normalisation, $\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}}$ is [equivalent to][pygcn] (with $\tilde{A}$ as `mx`):
```
rowsum = np.array(mx.sum(1)) # sum along rows
r_inv = np.power(rowsum, -1).flatten() # 1./rowsum elementwise
r_mat_inv = sp.diags(r_inv) # cast to sparse diagonal matrix
mx = r_mat_inv.dot(mx) # sparse matrix multiply
```
The symmetric way to express this is part of the [symmetric normalized Laplacian][laplace]. This work, and the [other][hammond] [work][defferrard] on graph convolutional networks, is motivated by convolving a parametric filter over $x$. Convolution becomes easy if we can perform a *graph Fourier transform* (don't worry I don't understand this either), but that requires us having access to eigenvectors of the normalized graph Laplacian (which is expensive).
[Hammond's early paper][hammond] suggested getting around this problem by using Chebyshev polynomials for the approximation. This paper takes the approximation even further, using only *first order* Chebyshev polynomial, on the assumption that this will be fine because the modeling capacity can be picked up by the deep neural network.
That's how the propagation rule above is derived, but we don't really need to remember the details. In practice $\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} = \hat{A}$ is calculated prior and using a graph convolutional network involves multiplying your activations by this sparse matrix at every hidden layer. If you're thinking in terms of a batch with $N$ examples and $D$ features, this multiplication happens *over the examples*, mixing datapoints together (according to the graph structure). If you want to think of this in an orthodox deep learning way, it's the following:
```
activations = F.linear(H_lm1, W_l) # (N,D)
activations = activations.permute(1,0) # (D,N)
activations = F.linear(activations, hat_A) # (D,N)
activations = activations.permute(1,0) # (N,D)
H_l = F.relu(activations)
```
**Won't this be really slow though, $\hat{A}$ is $(N,N)$!** Yes, if you implemented it that way it would be very slow. But, many deep learning frameworks support sparse matrix operations ([although maybe not the backward pass][sparse]). Using that, a graph convolutional layer can be implemented [quite easily][pygcnlayer].
**Wait a second, these are batches, not minibatches?** Yup, minibatches are future work.
**What are the experimental results, though?** There are experiments showing this works well for semi-supervised experiments on graphs, as advertised. Also, the many approximations to get the propagation rule at the top are justified by experiment.
**This summary is bad.** Fine, smarter people have written their own posts: [the author's][kipf], [Ferenc's][ferenc].
[adj]: https://en.wikipedia.org/wiki/Adjacency_matrix
[pygcn]: https://github.com/tkipf/pygcn/blob/master/pygcn/utils.py#L56-L63
[laplace]: https://en.wikipedia.org/wiki/Laplacian_matrix#Symmetric_normalized_Laplacian
[hammond]: https://arxiv.org/abs/0912.3848
[defferrard]: https://arxiv.org/abs/1606.09375
[sparse]: https://discuss.pytorch.org/t/does-pytorch-support-autograd-on-sparse-matrix/6156/7
[pygcnlayer]: https://github.com/tkipf/pygcn/blob/master/pygcn/layers.py#L35-L68
[kipf]: https://tkipf.github.io/graph-convolutional-networks/
[ferenc]: http://www.inference.vc/how-powerful-are-graph-convolutions-review-of-kipf-welling-2016-2/

more
less