[link]
Summary by Liew Jun Hao 8 years ago
#### Introduction
This [paper](https://github.com/lucastheis/ride) introduces *recurrent image density estimator* (RIDE), a generative model by combining a *multidimensional* recurrent neural network with mixtures of experts to model the distribution of natural image. In this work, the authors used *spatial* LSTMs (SLSTM) to capture the semantics in the form of hidden states where these hidden vectors are then fed into a factorized *factorized mixtures of conditional Gaussian scale mixtures* (MCGSMs) to predict the state of the corresponding pixels.
##### __1. Spatial long short-term memory (SLSTM)__
This is a straightforward extension of the multidimensional RNN in order to capture long range interaction. Let $\mathbf{x}$ be a grayscale image patch and $x_{ij}$ be the intensity of pixel at location ${ij}$. At each location $ij$, each LSTM unit perform the following operations:
$\mathbf{c}_{ij} = \mathbf{g}_{ij} \odot \mathbf{i}_{ij} + \mathbf{c}_{i,j-1} \odot \mathbf{f}^c_{ij} + \mathbf{c}_{i-1,j} \odot \mathbf{f}^r_{ij} $
$\mathbf{h}_{ij} = \tanh(\mathbf{c}_{ij} \odot \mathbf{o}_{ij})$
$\begin{pmatrix}
\mathbf{g}_{ij} \\ \mathbf{o}_{ij} \\ \mathbf{i}_{ij} \\ \mathbf{g}_{ij}\\ \mathbf{f}_{ij}^r\\ \mathbf{f}_{ij}^c
\end{pmatrix} = \begin{pmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \\ \sigma\\ \sigma \end{pmatrix} T_{\mathbf{A,b}} \begin{pmatrix} \mathbf{x}_{<ij} \\ \mathbf{h}_{i,j-1} \\ \mathbf{h}_{i-1,j} \end{pmatrix} $
where $\mathbf{c}_{ij}$ and $\mathbf{h}_{ij}$ are memory units and hidden units respectively. Note that, there are 2 different forget gates $\mathbf{f}^c_{ij}$ and $\mathbf{f}^r_{ij}$ for the 2 preceding memory states $\mathbf{c}_{i,j-1}$ and $\mathbf{c}_{i-1,j}$. Also note that $\mathbf{x}_{<ij}$ here denotes a set of *causal neighborhood* by applying Markov assumption.
![ride_1](http://i.imgur.com/W8ugGvl.png)
As shown in Fig. C, although the prediction of a pixel depends only on its neighborhood (green) through feedforward connections, there is an indirect connection to a much larger region (red) via recurrent connections.
##### __2. Factorized mixtures of conditional Gaussian scale mixtures__
A generative model can usually be expressed as $p(\mathbf{x};\mathbf{\theta}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta})$ using chain rule. One way to improve the representational power of a model is to introduce different sets of parameters for each pixel, i.e. $p(\mathbf{x}; \{ \mathbf{\theta} \}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta}_{ij})$. However, untying shared parameters will lead to drastic increase of parameters. Therefore, the author applied 2 simple common used assumptions:
1. __Markov assumption__: $\mathbf{x}_{<ij}$ is limited to small neighborhood around $x_{ij}$ (causal neighborhood)
2. __Stationary and shift invariance__: the same set of $\mathbf{\theta}_{ij}$ is used for every location ${ij}$ which corresponds to recurrent structure in RNN.
Therefore, the hidden vector from SLSTMs can be fed into the MCGSM to predict the state of corresponding label, i.e. $p(x_{ij} | \textbf{x}_{<ij}) = p(x_{ij} | \textbf{h}_{ij})$.
The conditional distribution distribution in MCGSM is represented as a mixture of experts:
$p(x_{ij} | \mathbf{x}_{<ij}; \mathbf{\theta}_{ij}) = \sum_{c,s} p(c, s | \mathbf{x}_{<ij}, \mathbf{\theta}_{ij}) p (x_{ij} | \mathbf{x}_{<ij}, c, s, \mathbf{\theta}_{ij})$.
where the first and second term correspond to gate and experts respectively. To further reduce the number of parameters, the authors proposed using a *factorized* MCGSM in order to use larger neighborhoods and more mixture components. (*__Remarks__: I am not too sure about the exact training of MCGSM, but as far as I understand, the MCGSM is firstly trained end-to-end with SLSTM using SGD with momentum and then finetuned using L-BFGS after each epoch by fixing the parameters of SLSTM.*)
* For training:
```
for n in range(num_epochs):
for b in range(0, inputs.shape[0] - batch_size + 1, batch_size):
# compute gradients
f, df = f_df(params, b)
loss.append(f / log(2.) / self.num_channels)
# update SLSTM parameters
for l in train_layers:
for key in params['slstm'][l]:
diff['slstm'][l][key] = momentum * diff['slstm'][l][key] - df['slstm'][l][key]
params['slstm'][l][key] = params['slstm'][l][key] + learning_rate * diff['slstm'][l][key]
# update MCGSM parameters
diff['mcgsm'] = momentum * diff['mcgsm'] - df['mcgsm']
params['mcgsm'] = params['mcgsm'] + learning_rate * diff['mcgsm']
```
* Finetuning (part of the code)
```
for l in range(self.num_layers):
self.slstm[l] = SLSTM(
num_rows=hiddens.shape[1],
num_cols=hiddens.shape[2],
num_channels=hiddens.shape[3],
num_hiddens=self.num_hiddens,
batch_size=min([hiddens.shape[0], self.MAX_BATCH_SIZE]),
nonlinearity=self.nonlinearity,
extended=self.extended,
slstm=self.slstm[l],
verbosity=self.verbosity)
hiddens = self.slstm[l].forward(hiddens)
# finetune with early stopping based on validation performance
return self.mcgsm.train(
hiddens_train, outputs_train,
hiddens_valid, outputs_valid,
parameters={
'verbosity': self.verbosity,
'train_means': train_means,
'max_iter': max_iter})
```
more
less