[link]
#### Introduction Most recent semantic segmentation algorithms rely (explicitly or implicitly) on FCN. However, the large receptive field and many pooling layers lead to low spatial resolution in the deep layers. On top of that, the lack of explicit pixelwise grouping mechanism often produces spatially fragmented and inconsistent results. In order to solve this, the authors proposed a Convolutional Random Walk Networks (RWNs) to diffuse the FCN potentials in a random walk fashion based on learned pixelwise affinities to enforce the spatial consistency of segmentation. One main contribution by the authors is that RWN needs only 131 additional parameters than the DeepLab architecture and yet outperform DeepLab by 1.5% on Pascal SBD dataset. ##### __1. Review of random graph walks__ In graph theory, an undirected graph is defined as $G=(V,E)$ where $V$ and $E$ are vertices and edges respectively. Then a random walk in a graph is characterized by the transition probabilities between vertices. Let $W$ be a $n \times n$ symmetric *affinity* matrix where $W_{ij}$ encodes the similarity of nodes $i$ and $j$ (usually with Gaussian affinities). Then, the random walk transition matrix, $A$ is defined as $A = D^{-1}W$ where $D$ is a $n \times n$ diagonal *degree* matrix. Let $y_t$ denotes the nodes distribution at time $t$, the distribution after one step of random walk process is $y_{t+1}=Ay_{t}$. The random walk process can be iterated until convergence. ##### __2. Overall architecture__ The overall architecture consists of 3 branches: * semantic segmentation branch (which is FCN) * pixel-level affinity branch (to learn affinities) * random walk layer (diffuse FCN potentials based on learned affinities) ![RWN](http://i.imgur.com/au5PoY2.png) ##### __A) Semantic segmentation branch__ This authors employed DeepLab-LargeFOV FCN architecture as the semantic segmentation branch. As a result, the resolution of $fc8$ activation will be of 8 times lower than that of the original image. Let $f \in \mathbb{R}^{n \times n \times m}$ denote the $fc8$ activations where $n$ refers to height/ width of image and $m$ denotes the features dimension. ##### __B) Pixelwise affinity branch__ Hand-crafted affinities are usually in the form of Gaussian, i.e. $\exp\frac{(x-y)^2}{\sigma^2}$ where $x$ and $y$ are usually pixel intensities while $\sigma$ control the smoothness. In this work, the authors argued that the learned affinities work better than the hand-crafted color affinities. Apart from RGB features, $conv1\texttt{_}1$ (64 dimensional) and $conv1\texttt{_}2$ (64 dimensional) are also employed to build the affinities. In particular, the 3 features are first downsampled by 8 times to match that of $fc8$ and concatenated to form a matrix of $n \times n \times k$ where $k=131$ (since 3+64+64=131). Then, the $L1$ pairwise distance is computed for __each__ dimension to form a __sparse__ matrix, $F \in \mathbb{R}^{n^2 \times n^2 \times 131}$ (the sparsity is due to the fact the distance is computed for pixel pairs within radius of $R$ only). A $1 \times 1 \times 1$ $conv$ is attached (dimension of kernel is therefore 131, which attributes to the only additional learned parameters in this work) followed by an $\exp$ layer, forming a sparse affinity matrix, $W \in \mathbb{R}^{n^2 \times n^2 \times 1}$. An Euclidean loss layer is attached to optimize w.r.t. the ground truth pixel affinities obtained from semantic segmentation annotations. ##### __C) Random walk layer__ The random walk layer diffuses the $fc8$ potentials from semantic segmentation branch using the learned pixelwise affinity $W$. First, the random walk transition matrix $A$ is computed by row-normalizing $W$. The diffused segmentation prediction is therefore $\hat{y}=A^tf$ to simulate $t$ random walk steps. The random walk layer is finally attached to a softmax layer (with cross-entropy loss) and trained end-to-end. ##### 3. Discussion * Although RWN demonstrates the improvement of the coarse prediction, post-processing such as Dense-CRF or Graph Cuts is still required. * The authors showed that the learned affinity is better than the hand-crafted color affinities. This is probably due to the findings that $conv1\texttt{_}2$ features helped improving the prediction. * The authors observed that a single random walk steps is the optimal. * For the pixelwise affinity branches, only $conv1\texttt{_}1$, $conv1\texttt{_}2$ and RGB cues are used due to their same spatial dimension as the original image. Intuitively, only low level features are required to ensure that higher level features (from later layers) won't diffuse across boundaries (which is encoded in earlier layers). #### Conclusion The authors proposed a RWN that diffuses the higher level (more abstract) features based on __learned__ pixelwise affinities (lower level cues) in a random walk fashion.
1 Comments
|
[link]
#### Introduction This [paper](https://github.com/lucastheis/ride) introduces *recurrent image density estimator* (RIDE), a generative model by combining a *multidimensional* recurrent neural network with mixtures of experts to model the distribution of natural image. In this work, the authors used *spatial* LSTMs (SLSTM) to capture the semantics in the form of hidden states where these hidden vectors are then fed into a factorized *factorized mixtures of conditional Gaussian scale mixtures* (MCGSMs) to predict the state of the corresponding pixels. ##### __1. Spatial long short-term memory (SLSTM)__ This is a straightforward extension of the multidimensional RNN in order to capture long range interaction. Let $\mathbf{x}$ be a grayscale image patch and $x_{ij}$ be the intensity of pixel at location ${ij}$. At each location $ij$, each LSTM unit perform the following operations: $\mathbf{c}_{ij} = \mathbf{g}_{ij} \odot \mathbf{i}_{ij} + \mathbf{c}_{i,j-1} \odot \mathbf{f}^c_{ij} + \mathbf{c}_{i-1,j} \odot \mathbf{f}^r_{ij} $ $\mathbf{h}_{ij} = \tanh(\mathbf{c}_{ij} \odot \mathbf{o}_{ij})$ $\begin{pmatrix} \mathbf{g}_{ij} \\ \mathbf{o}_{ij} \\ \mathbf{i}_{ij} \\ \mathbf{g}_{ij}\\ \mathbf{f}_{ij}^r\\ \mathbf{f}_{ij}^c \end{pmatrix} = \begin{pmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \\ \sigma\\ \sigma \end{pmatrix} T_{\mathbf{A,b}} \begin{pmatrix} \mathbf{x}_{<ij} \\ \mathbf{h}_{i,j-1} \\ \mathbf{h}_{i-1,j} \end{pmatrix} $ where $\mathbf{c}_{ij}$ and $\mathbf{h}_{ij}$ are memory units and hidden units respectively. Note that, there are 2 different forget gates $\mathbf{f}^c_{ij}$ and $\mathbf{f}^r_{ij}$ for the 2 preceding memory states $\mathbf{c}_{i,j-1}$ and $\mathbf{c}_{i-1,j}$. Also note that $\mathbf{x}_{<ij}$ here denotes a set of *causal neighborhood* by applying Markov assumption. ![ride_1](http://i.imgur.com/W8ugGvl.png) As shown in Fig. C, although the prediction of a pixel depends only on its neighborhood (green) through feedforward connections, there is an indirect connection to a much larger region (red) via recurrent connections. ##### __2. Factorized mixtures of conditional Gaussian scale mixtures__ A generative model can usually be expressed as $p(\mathbf{x};\mathbf{\theta}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta})$ using chain rule. One way to improve the representational power of a model is to introduce different sets of parameters for each pixel, i.e. $p(\mathbf{x}; \{ \mathbf{\theta} \}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta}_{ij})$. However, untying shared parameters will lead to drastic increase of parameters. Therefore, the author applied 2 simple common used assumptions: 1. __Markov assumption__: $\mathbf{x}_{<ij}$ is limited to small neighborhood around $x_{ij}$ (causal neighborhood) 2. __Stationary and shift invariance__: the same set of $\mathbf{\theta}_{ij}$ is used for every location ${ij}$ which corresponds to recurrent structure in RNN. Therefore, the hidden vector from SLSTMs can be fed into the MCGSM to predict the state of corresponding label, i.e. $p(x_{ij} | \textbf{x}_{<ij}) = p(x_{ij} | \textbf{h}_{ij})$. The conditional distribution distribution in MCGSM is represented as a mixture of experts: $p(x_{ij} | \mathbf{x}_{<ij}; \mathbf{\theta}_{ij}) = \sum_{c,s} p(c, s | \mathbf{x}_{<ij}, \mathbf{\theta}_{ij}) p (x_{ij} | \mathbf{x}_{<ij}, c, s, \mathbf{\theta}_{ij})$. where the first and second term correspond to gate and experts respectively. To further reduce the number of parameters, the authors proposed using a *factorized* MCGSM in order to use larger neighborhoods and more mixture components. (*__Remarks__: I am not too sure about the exact training of MCGSM, but as far as I understand, the MCGSM is firstly trained end-to-end with SLSTM using SGD with momentum and then finetuned using L-BFGS after each epoch by fixing the parameters of SLSTM.*) * For training: ``` for n in range(num_epochs): for b in range(0, inputs.shape[0] - batch_size + 1, batch_size): # compute gradients f, df = f_df(params, b) loss.append(f / log(2.) / self.num_channels) # update SLSTM parameters for l in train_layers: for key in params['slstm'][l]: diff['slstm'][l][key] = momentum * diff['slstm'][l][key] - df['slstm'][l][key] params['slstm'][l][key] = params['slstm'][l][key] + learning_rate * diff['slstm'][l][key] # update MCGSM parameters diff['mcgsm'] = momentum * diff['mcgsm'] - df['mcgsm'] params['mcgsm'] = params['mcgsm'] + learning_rate * diff['mcgsm'] ``` * Finetuning (part of the code) ``` for l in range(self.num_layers): self.slstm[l] = SLSTM( num_rows=hiddens.shape[1], num_cols=hiddens.shape[2], num_channels=hiddens.shape[3], num_hiddens=self.num_hiddens, batch_size=min([hiddens.shape[0], self.MAX_BATCH_SIZE]), nonlinearity=self.nonlinearity, extended=self.extended, slstm=self.slstm[l], verbosity=self.verbosity) hiddens = self.slstm[l].forward(hiddens) # finetune with early stopping based on validation performance return self.mcgsm.train( hiddens_train, outputs_train, hiddens_valid, outputs_valid, parameters={ 'verbosity': self.verbosity, 'train_means': train_means, 'max_iter': max_iter}) ``` |