Generative Image Modeling Using Spatial LSTMs on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com
Generative Image Modeling Using Spatial LSTMs
Lucas Theis and Matthias Bethge
arXiv e-Print archive - 2015 via Local arXiv
Keywords: stat.ML, cs.CV, cs.LG
more
Summaries/Notes 1
[link] Summary by Liew Jun Hao 8 years ago
#### Introduction
This [paper](https://github.com/lucastheis/ride) introduces *recurrent image density estimator* (RIDE), a generative model by combining a *multidimensional* recurrent neural network with mixtures of experts to model the distribution of natural image. In this work, the authors used  *spatial* LSTMs (SLSTM) to capture the semantics in the form of hidden states where these hidden vectors are then fed into a factorized *factorized mixtures of conditional Gaussian scale mixtures* (MCGSMs) to predict the state of the corresponding pixels.

##### __1. Spatial long short-term memory (SLSTM)__
This is a straightforward extension of the multidimensional RNN in order to capture long range interaction. Let $\mathbf{x}$ be a grayscale image patch and $x_{ij}$ be the intensity of pixel at location ${ij}$. At each location $ij$, each LSTM unit perform the following operations:

$\mathbf{c}_{ij} = \mathbf{g}_{ij} \odot \mathbf{i}_{ij} + \mathbf{c}_{i,j-1} \odot \mathbf{f}^c_{ij} + \mathbf{c}_{i-1,j} \odot \mathbf{f}^r_{ij} $ 

$\mathbf{h}_{ij} = \tanh(\mathbf{c}_{ij} \odot \mathbf{o}_{ij})$

$\begin{pmatrix} 
\mathbf{g}_{ij} \\ \mathbf{o}_{ij} \\ \mathbf{i}_{ij} \\ \mathbf{g}_{ij}\\ \mathbf{f}_{ij}^r\\ \mathbf{f}_{ij}^c
\end{pmatrix} = \begin{pmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \\ \sigma\\ \sigma  \end{pmatrix} T_{\mathbf{A,b}}  \begin{pmatrix} \mathbf{x}_{<ij} \\ \mathbf{h}_{i,j-1} \\ \mathbf{h}_{i-1,j} \end{pmatrix} $

where $\mathbf{c}_{ij}$ and $\mathbf{h}_{ij}$ are memory units and hidden units respectively. Note that, there are 2 different forget gates $\mathbf{f}^c_{ij}$ and $\mathbf{f}^r_{ij}$ for the 2 preceding memory states $\mathbf{c}_{i,j-1}$ and $\mathbf{c}_{i-1,j}$. Also note that $\mathbf{x}_{<ij}$ here denotes a set of *causal neighborhood* by applying Markov assumption.

![ride_1](http://i.imgur.com/W8ugGvl.png)
As shown in Fig. C, although the prediction of a pixel depends only on its neighborhood (green) through feedforward connections, there is an indirect connection to a much larger region (red) via recurrent connections.

##### __2. Factorized mixtures of conditional Gaussian scale mixtures__
A generative model can usually be expressed as $p(\mathbf{x};\mathbf{\theta}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta})$ using chain rule. One way to improve the representational power of a model is to introduce different sets of parameters for each pixel, i.e. $p(\mathbf{x}; \{ \mathbf{\theta} \}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta}_{ij})$. However, untying shared parameters will lead to drastic increase of parameters. Therefore, the author applied 2 simple common used assumptions:
1. __Markov  assumption__: $\mathbf{x}_{<ij}$ is limited to small neighborhood around $x_{ij}$ (causal neighborhood)
2. __Stationary and shift invariance__: the same set of $\mathbf{\theta}_{ij}$ is used for every location ${ij}$ which corresponds to recurrent structure in RNN.

Therefore, the hidden vector from SLSTMs can be fed into the MCGSM to predict the state of corresponding label, i.e. $p(x_{ij} | \textbf{x}_{<ij}) = p(x_{ij} | \textbf{h}_{ij})$.

The conditional distribution distribution in MCGSM is represented as a mixture of experts:

$p(x_{ij} | \mathbf{x}_{<ij}; \mathbf{\theta}_{ij}) = \sum_{c,s} p(c, s | \mathbf{x}_{<ij}, \mathbf{\theta}_{ij}) p (x_{ij} | \mathbf{x}_{<ij}, c, s, \mathbf{\theta}_{ij})$.

where the first and second term correspond to gate and experts respectively. To further reduce the number of parameters, the authors proposed using a *factorized* MCGSM in order to use larger neighborhoods and more mixture components. (*__Remarks__: I am not too sure about the exact training of MCGSM, but as far as I understand, the MCGSM is firstly trained end-to-end with SLSTM using SGD with momentum and then finetuned using L-BFGS after each epoch by fixing the parameters of SLSTM.*)

* For training:

```
for n in range(num_epochs):
	for b in range(0, inputs.shape[0] - batch_size + 1, batch_size):
		# compute gradients
		f, df = f_df(params, b)

		loss.append(f / log(2.) / self.num_channels)

		# update SLSTM parameters
		for l in train_layers:
			for key in params['slstm'][l]:
				diff['slstm'][l][key] = momentum * diff['slstm'][l][key] - df['slstm'][l][key]
				params['slstm'][l][key] = params['slstm'][l][key] + learning_rate * diff['slstm'][l][key]

		# update MCGSM parameters
		diff['mcgsm'] = momentum * diff['mcgsm'] - df['mcgsm']
		params['mcgsm'] = params['mcgsm'] + learning_rate * diff['mcgsm']
```

* Finetuning (part of the code)

```
for l in range(self.num_layers):
	self.slstm[l] = SLSTM(
		num_rows=hiddens.shape[1],
		num_cols=hiddens.shape[2],
		num_channels=hiddens.shape[3],
		num_hiddens=self.num_hiddens,
		batch_size=min([hiddens.shape[0], self.MAX_BATCH_SIZE]),
		nonlinearity=self.nonlinearity,
		extended=self.extended,
		slstm=self.slstm[l],
		verbosity=self.verbosity)

	hiddens = self.slstm[l].forward(hiddens)

# finetune with early stopping based on validation performance
return self.mcgsm.train(
	hiddens_train, outputs_train,
	hiddens_valid, outputs_valid,
	parameters={
		'verbosity': self.verbosity,
		'train_means': train_means,
		'max_iter': max_iter})
```
Your comment:
Write your summary here (You can use $\LaTeX$ and markdown syntax):
Anon Private