[link]
Summary by Liew Jun Hao 8 years ago
#### Introduction
Most recent semantic segmentation algorithms rely (explicitly or implicitly) on FCN. However, the large receptive field and many pooling layers lead to low spatial resolution in the deep layers. On top of that, the lack of explicit pixelwise grouping mechanism often produces spatially fragmented and inconsistent results. In order to solve this, the authors proposed a Convolutional Random Walk Networks (RWNs) to diffuse the FCN potentials in a random walk fashion based on learned pixelwise affinities to enforce the spatial consistency of segmentation. One main contribution by the authors is that RWN needs only 131 additional parameters than the DeepLab architecture and yet outperform DeepLab by 1.5% on Pascal SBD dataset.
##### __1. Review of random graph walks__
In graph theory, an undirected graph is defined as $G=(V,E)$ where $V$ and $E$ are vertices and edges respectively. Then a random walk in a graph is characterized by the transition probabilities between vertices. Let $W$ be a $n \times n$ symmetric *affinity* matrix where $W_{ij}$ encodes the similarity of nodes $i$ and $j$ (usually with Gaussian affinities). Then, the random walk transition matrix, $A$ is defined as $A = D^{-1}W$ where $D$ is a $n \times n$ diagonal *degree* matrix. Let $y_t$ denotes the nodes distribution at time $t$, the distribution after one step of random walk process is $y_{t+1}=Ay_{t}$. The random walk process can be iterated until convergence.
##### __2. Overall architecture__
The overall architecture consists of 3 branches:
* semantic segmentation branch (which is FCN)
* pixel-level affinity branch (to learn affinities)
* random walk layer (diffuse FCN potentials based on learned affinities)
![RWN](http://i.imgur.com/au5PoY2.png)
##### __A) Semantic segmentation branch__
This authors employed DeepLab-LargeFOV FCN architecture as the semantic segmentation branch. As a result, the resolution of $fc8$ activation will be of 8 times lower than that of the original image. Let $f \in \mathbb{R}^{n \times n \times m}$ denote the $fc8$ activations where $n$ refers to height/ width of image and $m$ denotes the features dimension.
##### __B) Pixelwise affinity branch__
Hand-crafted affinities are usually in the form of Gaussian, i.e. $\exp\frac{(x-y)^2}{\sigma^2}$ where $x$ and $y$ are usually pixel intensities while $\sigma$ control the smoothness. In this work, the authors argued that the learned affinities work better than the hand-crafted color affinities. Apart from RGB features, $conv1\texttt{_}1$ (64 dimensional) and $conv1\texttt{_}2$ (64 dimensional) are also employed to build the affinities. In particular, the 3 features are first downsampled by 8 times to match that of $fc8$ and concatenated to form a matrix of $n \times n \times k$ where $k=131$ (since 3+64+64=131). Then, the $L1$ pairwise distance is computed for __each__ dimension to form a __sparse__ matrix, $F \in \mathbb{R}^{n^2 \times n^2 \times 131}$ (the sparsity is due to the fact the distance is computed for pixel pairs within radius of $R$ only). A $1 \times 1 \times 1$ $conv$ is attached (dimension of kernel is therefore 131, which attributes to the only additional learned parameters in this work) followed by an $\exp$ layer, forming a sparse affinity matrix, $W \in \mathbb{R}^{n^2 \times n^2 \times 1}$. An Euclidean loss layer is attached to optimize w.r.t. the ground truth pixel affinities obtained from semantic segmentation annotations.
##### __C) Random walk layer__
The random walk layer diffuses the $fc8$ potentials from semantic segmentation branch using the learned pixelwise affinity $W$. First, the random walk transition matrix $A$ is computed by row-normalizing $W$. The diffused segmentation prediction is therefore $\hat{y}=A^tf$ to simulate $t$ random walk steps. The random walk layer is finally attached to a softmax layer (with cross-entropy loss) and trained end-to-end.
##### 3. Discussion
* Although RWN demonstrates the improvement of the coarse prediction, post-processing such as Dense-CRF or Graph Cuts is still required.
* The authors showed that the learned affinity is better than the hand-crafted color affinities. This is probably due to the findings that $conv1\texttt{_}2$ features helped improving the prediction.
* The authors observed that a single random walk steps is the optimal.
* For the pixelwise affinity branches, only $conv1\texttt{_}1$, $conv1\texttt{_}2$ and RGB cues are used due to their same spatial dimension as the original image. Intuitively, only low level features are required to ensure that higher level features (from later layers) won't diffuse across boundaries (which is encoded in earlier layers).
#### Conclusion
The authors proposed a RWN that diffuses the higher level (more abstract) features based on __learned__ pixelwise affinities (lower level cues) in a random walk fashion.
more
less