ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

The Reversible Residual Network: Backpropagation Without Storing Activations.
Aidan N. Gomez and Mengye Ren and Raquel Urtasun and Roger B. Grosse
Neural Information Processing Systems Conference - 2017 via Local dblp
Keywords:

[link] Summary by ameroyer 6 years ago

Residual Networks (ResNets)  have greatly advanced the state-of-the-art in Deep Learning by making  it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feed-forward pass, leading to high memory requirements for these very deep networks.

Instead, the authors propose a **reversible architecture** based on ResNets, in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design  a more efficient implementation of backpropagation, effectively trading compute power for memory storage.
  * **Pros (+): ** The change does not negatively impact model accuracy (for equivalent number of model parameters) and it only requires a small change in the backpropagation algorithm.
  * **Cons (-): ** Increased number of parameters, thus need to change the unit depth to match the "equivalent" ResNet

---

# Proposed Architecture

## RevNet
This paper proposes to incorporate idea from previous reversible architectures, such as NICE [1], into a standard ResNet. The resulting model is called **RevNet** and is composed of reversible blocks, inspired from *additive coupling* [1, 2]:

$
 \begin{array}{r|r}
\texttt{RevNet Block} & \texttt{Inverse Transformation}\\
\hline
\mathbf{input }\  x & \mathbf{input }\  y \\
x_1, x_2 = \mbox{split}(x) & y1, y2 = \mbox{split}(y)\\
y_1 = x_1 + \mathcal{F}(x_2) & x_2 = y_2 - \mathcal{G}(y_1) \\
y_2 = x_2 + \mathcal{G}(y_1) & x_1 = y_1 - \mathcal{F}(x_2)\\
\mathbf{output}\ y = (y_1, y_2) & \mathbf{output}\ x = (x_1, x_2)
\end{array}
$


where $\mathcal F$ and $\mathcal G$ are residual functions, composed of sequences of convolutions, ReLU and Batch Normalization layers, analoguous to the ones in a standard ResNet block, although operations in the reversible blocks need to have a stride of 1 to avoid information loss and preserve invertibility. Finally, for the `split` operation, the authors consider spliting the input Tensor across the channel dimension as in [1, 2].

Similarly to ResNet, the final RevNet architecture is composed of these invertible residual blocks, as well as non-reversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical ResNet architecture. 

## Backpropagation

### Standard
The backpropagaton algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters  in a neural network: given a loss function $L$, we want to compute the gradients of $L$ with respect to the parameters of each layer, indexed by $n \in [1, N]$, i.e., the quantities $ \overline{\theta_{n}} = \partial L /\ \partial \theta_n$. (where $\forall x, \bar{x} = \partial L / \partial x$).

We roughly summarize the algorithm in the left column of **Table 1**: In order to compute the gradients for the $n$-th block, backpropagation requires the input and output activation of this block, $y_{n - 1}$ and $y_{n}$, which have been stored, and the derivative of the loss respectively to the output, $\overline{y_{n}}$, which has been computed in the backpropagation iteration of the upper layer; Hence the name backpropagation

### RevNet
Since activations are not stored in RevNet, the algorithm needs to be slightly modified, which we describe in the right column of **Table 1**. In summary, we first need to recover the input activations of the RevNet block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation.  Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. $\overline{y_{n - 1}} = (\overline{y_{n -1, 1}}, \overline{y_{n - 1, 2}})$, using the fact that:
$
\begin{align}
\overline{y_{n - 1, i}} = \overline{y_{n, 1}}\ \frac{\partial y_{n, 1}}{y_{n - 1, i}} + \overline{y_{n, 2}}\ \frac{\partial y_{n, 2}}{y_{n - 1, i}}
\end{align}
$

Once again, this result will be propagated further down the network.
Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, $\theta_n$.





$
 \begin{array}{|c|l|l|}
\hline
&  \mathbf{ResNet} & \mathbf{RevNet} \\
\hline
\mathbf{Block} & y_{n} = y_{n - 1} + \mathcal F(y_{n - 1}) & y_{n - 1, 1}, y_{n - 1, 2} = \mbox{split}(y_{n - 1})\\
&& y_{n, 1} = y_{n - 1, 1} + \mathcal{F}(y_{n - 1, 2})\\
&& y_{n, 2} =  y_{n - 1, 2} + \mathcal{G}(y_{n, 1})\\
 && y_{n} = (y_{n, 1}, y_{n, 2})\\
\hline
\mathbf{Params} & \theta = \theta_{\mathcal F} & \theta = (\theta_{\mathcal F}, \theta_{\mathcal G})\\
\hline
\mathbf{Backprop} & \mathbf{in:}\  y_{n - 1}, y_{n}, \overline{ y_{n}} & \mathbf{in:}\ y_{n}, \overline{y_{n }}\\
& \overline{\theta_n} =\overline{y_n} \frac{\partial y_n}{\partial \theta_n} &\texttt{# recover activations} \\
&\overline{y_{n - 1}} = \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n-1}} &y_{n, 1}, y_{n, 2} = \mbox{split}(y_{n}) \\
&\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n -1}} & y_{n - 1, 2} =  y_{n, 2} - \mathcal{G}(y_{n, 1})\\
&&y_{n - 1, 1} =  y_{n, 1} - \mathcal{F}(y_{n - 1, 2})\\
&&\texttt{#  gradients wrt. inputs} \\
&&\overline{y_{n -1, 1}} = \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\
&&\overline{y_{n -1, 2}} = \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\
&&\texttt{ gradients wrt. parameters} \\
&&\overline{\theta_{n, \mathcal G}} = \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\
&&\overline{\theta_{n, \mathcal F}} = \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\
&&\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n -1}}, y_{n - 1}\\
\hline
\end{array}
$

**Table 1:** Backpropagation in the standard case and for Reversible blocks




--- 

## Experiments


** Computational Efficiency.** RevNets trade off memory requirements, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, RevNet provides the best trade-off: no activations have to be stored, the spatial complexity is $O(1)$. For the computation complexity, it is linear in the number of layers, i.e. $O(L)$. 

One small disadvantage is that RevNets introduces additional parameters, as each block is composed of two residuals, $\mathcal F$ and $\mathcal G$, and their number of channels is also halved as the input is first split into two. 

**Results.** In the experiments section, the author compare ResNet architectures to their RevNets "counterparts": they build a RevNet with roughly the same number of parameters by halving the number of residual units and doubling the number of channels.

Interestingly, RevNets achieve **similar performances** to their ResNet counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model.

To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints.

---

## References
  * [1] NICE: Non-linear Independent Components Estimation, Dinh et al., ICLR 2015
  * [2] Density estimation using Real NVP, Dinh et al., ICLR 2017

arxiv.org
arxiv-vanity.com
scholar.google.com

Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
Antonio Valerio Miceli Barone
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.LG, cs.NE
more

[link] Summary by Jon Gauthier 8 years ago

This is a simple unsupervised method for learning word-level translation
between embeddings of two different languages.

That's right -- unsupervised.

The basic motivating hypothesis is that there should be an isomorphism between
the "semantic spaces" of different languages:

> we hypothesize that, if languages are used to convey thematically similar information in similar contexts, these random processes should be approximately isomorphic between languages, and that this isomorphism can be learned from the statistics of the realizations of these processes, the monolingual corpora, in principle without any form of explicit alignment.

If you squint a bit, you can make the more aggressive claim from this premise
that there should be a nonlinear / MLP mapping between *word embedding spaces*
that gets us the same result.

The author uses the adversarial autoencoder (AAE, from Makhzani last year)
framework in order to enforce a cross-lingual semantic mapping in word
embedding spaces. The basic setup for adversarial training between a source and
a target language:

1. Sample a batch of words from the source language according to the language's
word frequency distribution.
2. Sample a batch of words from the target language according to its word
frequency distribution. (No sort of relationship is enforced between the two
samples here.)
3. Feed the word embeddings corresponding to the source words through an
*encoder* MLP. This corresponds to the standard "generator" in a GAN setup.
4. Pass the generator output to a *discriminator* MLP along with the
target-language word embeddings.
5. Also pass the generator output to a *decoder* which maps back to the source
embedding distribution.
6. Update weights based on a combination of GAN loss + reconstruction loss.

### Does it work?

We don't really know. The paper is unfortunately short on evaluation --- we
just see a few examples of success and failure on a trained model. One easy
evaluation would be to compare accuracy in lexical mapping vs. corpus frequency
of the source word. I would bet that this would reveal the model hasn't done
much more than learn to align a small set of high-frequency words.

arxiv.org
arxiv-vanity.com
scholar.google.com

Neural Architecture Search with Reinforcement Learning
Barret Zoph and Quoc V. Le
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.AI, cs.NE
more

[link] Summary by abhigenie92 7 years ago

### Main Idea:
It basically tunes the hyper-parameters of the neural network architecture using reinforcement learning. The reward signal is taken as evaluation on the validation set. The method is policy gradient as the cost function is non-differentiable. 

### Method:
 #### i.   Actions:
1. There is controller RNN which predicts some hyper-parameters of the layer conditioned on the previous predictions. This prediction is just a one-hot vector based on the previous hyper-parameter chosen. At the start for the first prediction - this vector is just all zeros.
2. Once the network is generated completely, it is trained for a fixed number of epochs, the reward signal is calculated based on the evaluation on the validation set.

#### ii.  Training:
1. Nothing fancy in the reinforcement learning approach simple policy gradients.
2. Baseline is added to reduce the variance.
3.  It takes 2-3 weeks to train it over 800 GPUs!!

#### iii. Results:
1. Use it to generate CNNs and LSTM cells. Close to state-of-art results with generated architectures.

### Possible new directions:
1. Use better techniques RL techniques like TRPO, PPO etc.
2. Right now, they generate fixed length architecture. Their reason is for variable length architectures, it is difficult to determine how much time each architecture is trained. Smaller networks are easier to train. Thus, somehow determine training time as function of the learning capacity of the network.


Code:
They haven't released the code yet. I tried to simulate it in torch.(https://github.com/abhigenie92/nn_search)

scholar.google.com

Collaborative Filtering for Implicit Feedback Datasets
Hu, Yifan and Koren, Yehuda and Volinsky, Chris
International Conference on Data Mining - 2008 via Local Bibsonomy
Keywords: collaborativfiltering, alternaterootsquare

[link] Summary by Martin Thoma 5 years ago

This paper is about a recommendation system approach using collaborative filtering (CF) on implicit feedback datasets.

The core of it is the minimization problem

$$\min_{x_*, y_*} \sum_{u,i} c_{ui} (p_{ui} - x_u^T y_i)^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$

with

* $\lambda \in [0, \infty[$ is a hyper parameter which defines how strong the model is regularized
* $u$ denoting a user, $u_*$ are all user factors $x_u$ combined
* $i$ denoting an item, $y_*$ are all item factors $y_i$ combined
* $x_u \in \mathbb{R}^n$ is the latent user factor (embedding); $n$ is another hyper parameter. $n=50$ seems to be a reasonable choice.
* $y_i \in \mathbb{R}^n$ is the latent item factor (embedding)
* $r_{ui}$ defines the "intensity"; higher values mean user $u$ interacted more with item $i$
* $p_{ui} = \begin{cases}1 & \text{if } r_{ui} >0\\0 &\text{otherwise}\end{cases}$
* $c_{ui} := 1 + \alpha r_{ui}$ where $\alpha \in [0, \infty[$ is a hyper parameter; $\alpha =40$ seems to be reasonable

In contrast, the standard matrix factoriation optimization function looks like this ([example](https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf)):

$$\min_{x_*, y_*} \sum_{(u, i, r_{ui}) \in \mathcal{R}} {(r_{ui} - x_u^T y_i)}^2  + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$

where

* $\mathcal{R}$ is the set of all ratings $(u, i, r_{ui})$ - user $u$ has rated item $i$ with value $r_{ui} \in \mathbb{R}$

They use alternating least squares (ALS) to train this model.

The prediction then is the dot product between the user factor and all item factors ([source](https://github.com/benfred/implicit/blob/master/implicit/recommender_base.pyx#L157-L176))

dx.doi.org
sci-hub
scholar.google.com

Generative adversarial networks uncover epidermal regulators and predict single cell perturbations
Arsham Ghahramani and Fiona M Watt and Nicholas M Luscombe
bioRxiv: The preprint server for biology - 2018 via Local CrossRef
Keywords:

[link] Summary by David Stutz 7 years ago

Lee et al. propose a variant of adversarial training where a generator is trained simultaneously to generated adversarial perturbations. This approach follows the idea that it is possible to “learn” how to generate adversarial perturbations (as in [1]). In this case, the authors use the gradient of the classifier with respect to the input as hint for the generator. Both generator and classifier are then trained in an adversarial setting (analogously to generative adversarial networks), see the paper for details.

[1] Omid Poursaeed, Isay Katsman, Bicheng Gao, Serge Belongie. Generative Adversarial Perturbations. ArXiv, abs/1712.02328, 2017.