Welcome to ShortScience.org! |
[link]
This paper presents a recurrent neural network architecture in which some of the recurrent weights dynamically change during the forward pass, using a hebbian-like rule. They correspond to the matrices $A(t)$ in the figure below: ![Fast weights RNN figure](http://i.imgur.com/DCznSf4.png) These weights $A(t)$ are referred to as *fast weights*. Comparatively, the recurrent weights $W$ are referred to as slow weights, since they are only changing due to normal training and are otherwise kept constant at test time. More specifically, the proposed fast weights RNN compute a series of hidden states $h(t)$ over time steps $t$, but, unlike regular RNNs, the transition from $h(t)$ to $h(t+1)$ consists of multiple ($S$) recurrent layers $h_1(t+1), \dots, h_{S-1}(t+1), h_S(t+1)$, defined as follows: $$h_{s+1}(t+1) = f(W h(t) + C x(t) + A(t) h_s(t+1))$$ where $f$ is an element-wise non-linearity such as the ReLU activation. The next hidden state $h(t+1)$ is simply defined as the last "inner loop" hidden state $h_S(t+1)$, before moving to the next time step. As for the fast weights $A(t)$, they too change between time steps, using the hebbian-like rule: $$A(t+1) = \lambda A(t) + \eta h(t) h(t)^T$$ where $\lambda$ acts as a decay rate (to partially forget some of what's in the past) and $\eta$ as the fast weight's "learning rate" (not to be confused with the learning rate used during backprop). Thus, the role played by the fast weights is to rapidly adjust to the recent hidden states and remember the recent past. In fact, the authors show an explicit relation between these fast weights and memory-augmented architectures that have recently been popular. Indeed, by recursively applying and expending the equation for the fast weights, one obtains $$A(t) = \eta \sum_{\tau = 1}^{\tau = t-1}\lambda^{t-\tau-1} h(\tau) h(\tau)^T$$ *(note the difference with Equation 3 of the paper... I think there was a typo)* which implies that when computing the $A(t) h_s(t+1)$ term in the expression to go from $h_s(t+1)$ to $h_{s+1}(t+1)$, this term actually corresponds to $$A(t) h_s(t+1) = \eta \sum_{\tau =1}^{\tau = t-1} \lambda^{t-\tau-1} h(\tau) (h(\tau)^T h_s(t+1))$$ i.e. $A(t) h_s(t+1)$ is a weighted sum of all previous hidden states $h(\tau)$, with each hidden states weighted by an "attention weight" $h(\tau)^T h_s(t+1)$. The difference with many recent memory-augmented architectures is thus that the attention weights aren't computed using a softmax non-linearity. Experimentally, they find it beneficial to use [layer normalization](https://arxiv.org/abs/1607.06450). Good values for $\eta$ and $\lambda$ seem to be 0.5 and 0.9 respectively. I'm not 100% sure, but I also understand that using $S=1$, i.e. using the fast weights only once per time steps, was usually found to be optimal. Also see Figure 3 for the architecture used on the image classification datasets, which is slightly more involved. The authors present a series 4 experiments, comparing with regular RNNs (IRNNs, which are RNNs with ReLU units and whose recurrent weights are initialized to a scaled identity matrix) and LSTMs (as well as an associative LSTM for a synthetic associative retrieval task and ConvNets for the two image datasets). Generally, experiments illustrate that the fast weights RNN tends to train faster (in number of updates) and better than the other recurrent architectures. Surprisingly, the fast weights RNN can even be competitive with a ConvNet on the two image classification benchmarks, where the RNN traverses glimpses from the image using a fixed policy. **My two cents** This is a very thought provoking paper which, based on the comparison with LSTMs, suggests that fast weights RNNs might be a very good alternative. I'd be quite curious to see what would happen if one was to replace LSTMs with them in the myriad of papers using LSTMs (e.g. all the Seq2Seq work). Intuitively, LSTMs seem to be able to do more than just attending to the recent past. But, for a given task, if one was to observe that fast weights RNNs are competitive to LSTMs, it would suggests that the LSTM isn't doing something that much more complex. So it would be interesting to determine what are the tasks where the extra capacity of an LSTM is actually valuable and exploitable. Hopefully the authors will release some code, to facilitate this exploration. The discussion at the end of Section 3 on how exploiting the "memory augmented" view of fast weights is useful to allow the use of minibatches is interesting. However, it also suggests that computations in the fast weights RNN scales quadratically with the sequence size (since in this view, the RNN technically must attend to all previous hidden states, since the beginning of the sequence). This is something to keep in mind, if one was to consider applying this to very long sequences (i.e. much longer than the hidden state dimensionality). Also, I don't quite get the argument that the "memory augmented" view of fast weights is more amenable to mini-batch training. I understand that having an explicit weight matrix $A(t)$ for each minibatch sequence complicates things. However, in the memory augmented view, we also have a "memory matrix" that is different for each sequence, and yet we can handle that fine. The problem I can imagine is that storing a *sequence of arbitrary weight matrices* for each sequence might be storage demanding (and thus perhaps make it impossible to store a forward/backward pass for more than one sequence at a time), while the implicit memory matrix only requires appending a new row at each time step. Perhaps the argument to be made here is more that there's already mini-batch compatible code out there for dealing with the use of a memory matrix of stored previous memory states. This work strikes some (partial) resemblance to other recent work, which may serve as food for thought here. The use of possibly multiple computation layers between time steps reminds me of [Adaptive Computation Time (ACT) RNN]( http://www.shortscience.org/paper?bibtexKey=journals/corr/Graves16). Also, expressing a backpropable architecture that involves updates to weights (here, hebbian-like updates) reminds me of recent work that does backprop through the updates of a gradient descent procedure (for instance as in [this work]( http://www.shortscience.org/paper?bibtexKey=conf/icml/MaclaurinDA15)). Finally, while I was familiar with the notion of fast weights from the work on [Using Fast Weights to Improve Persistent Contrastive Divergence](http://people.ee.duke.edu/~lcarin/FastGibbsMixing.pdf), I didn't realize that this concept dated as far back as the late 80s. So, for young researchers out there looking for inspiration for research ideas, this paper confirms that looking at the older neural network literature for inspiration is probably a very good strategy :-) To sum up, this is really nice work, and I'm looking forward to the NIPS 2016 oral presentation of it! |
[link]
### What is BN: * Batch Normalization (BN) is a normalization method/layer for neural networks. * Usually inputs to neural networks are normalized to either the range of [0, 1] or [-1, 1] or to mean=0 and variance=1. The latter is called *Whitening*. * BN essentially performs Whitening to the intermediate layers of the networks. ### How its calculated: * The basic formula is $x^* = (x - E[x]) / \sqrt{\text{var}(x)}$, where $x^*$ is the new value of a single component, $E[x]$ is its mean within a batch and `var(x)` is its variance within a batch. * BN extends that formula further to $x^{**} = gamma * x^* +$ beta, where $x^{**}$ is the final normalized value. `gamma` and `beta` are learned per layer. They make sure that BN can learn the identity function, which is needed in a few cases. * For convolutions, every layer/filter/kernel is normalized on its own (linear layer: each neuron/node/component). That means that every generated value ("pixel") is treated as an example. If we have a batch size of N and the image generated by the convolution has width=P and height=Q, we would calculate the mean (E) over `N*P*Q` examples (same for the variance). ### Theoretical effects: * BN reduces *Covariate Shift*. That is the change in distribution of activation of a component. By using BN, each neuron's activation becomes (more or less) a gaussian distribution, i.e. its usually not active, sometimes a bit active, rare very active. * Covariate Shift is undesirable, because the later layers have to keep adapting to the change of the type of distribution (instead of just to new distribution parameters, e.g. new mean and variance values for gaussian distributions). * BN reduces effects of exploding and vanishing gradients, because every becomes roughly normal distributed. Without BN, low activations of one layer can lead to lower activations in the next layer, and then even lower ones in the next layer and so on. ### Practical effects: * BN reduces training times. (Because of less Covariate Shift, less exploding/vanishing gradients.) * BN reduces demand for regularization, e.g. dropout or L2 norm. (Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.) * BN allows higher learning rates. (Because of less danger of exploding/vanishing gradients.) * BN enables training with saturating nonlinearities in deep networks, e.g. sigmoid. (Because the normalization prevents them from getting stuck in saturating ranges, e.g. very high/low values for sigmoid.) ![MNIST and neuron activations](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Batch_Normalization__performance_and_activations.png?raw=true "MNIST and neuron activations") *BN applied to MNIST (a), and activations of a randomly selected neuron over time (b, c), where the middle line is the median activation, the top line is the 15th percentile and the bottom line is the 85th percentile.* ------------------------- ### Rough chapter-wise notes * (2) Towards Reducing Covariate Shift * Batch Normalization (*BN*) is a special normalization method for neural networks. * In neural networks, the inputs to each layer depend on the outputs of all previous layers. * The distributions of these outputs can change during the training. Such a change is called a *covariate shift*. * If the distributions stayed the same, it would simplify the training. Then only the parameters would have to be readjusted continuously (e.g. mean and variance for normal distributions). * If using sigmoid activations, it can happen that one unit saturates (very high/low values). That is undesired as it leads to vanishing gradients for all units below in the network. * BN fixes the means and variances of layer inputs to specific values (zero mean, unit variance). * That accomplishes: * No more covariate shift. * Fixes problems with vanishing gradients due to saturation. * Effects: * Networks learn faster. (As they don't have to adjust to covariate shift any more.) * Optimizes gradient flow in the network. (As the gradient becomes less dependent on the scale of the parameters and their initial values.) * Higher learning rates are possible. (Optimized gradient flow reduces risk of divergence.) * Saturating nonlinearities can be safely used. (Optimized gradient flow prevents the network from getting stuck in saturated modes.) * BN reduces the need for dropout. (As it has a regularizing effect.) * How BN works: * BN normalizes layer inputs to zero mean and unit variance. That is called *whitening*. * Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change. * A proper method has to include the current example *and* all previous examples in the normalization step. * This leads to calculating in covariance matrix and its inverse square root. That's expensive. The authors found a faster way. * (3) Normalization via Mini-Batch Statistics * Each feature (component) is normalized individually. (Due to cost, differentiability.) * Normalization according to: `componentNormalizedValue = (componentOldValue - E[component]) / sqrt(Var(component))` * Normalizing each component can reduce the expressitivity of nonlinearities. Hence the formula is changed so that it can also learn the identiy function. * Full formula: `newValue = gamma * componentNormalizedValue + beta` (gamma and beta learned per component) * E and Var are estimated for each mini batch. * BN is fully differentiable. Formulas for gradients/backpropagation are at the end of chapter 3 (page 4, left). * (3.1) Training and Inference with Batch-Normalized Networks * During test time, E and Var of each component can be estimated using all examples or alternatively with moving averages estimated during training. * During test time, the BN formulas can be simplified to a single linear transformation. * (3.2) Batch-Normalized Convolutional Networks * Authors recommend to place BN layers after linear/fully-connected layers and before the ninlinearities. * They argue that the linear layers have a better distribution that is more likely to be similar to a gaussian. * Placing BN after the nonlinearity would also not eliminate covariate shift (for some reason). * Learning a separate bias isn't necessary as BN's formula already contains a bias-like term (beta). * For convolutions they apply BN equally to all features on a feature map. That creates effective batch sizes of m\*pq, where m is the number of examples in the batch and p q are the feature map dimensions (height, width). BN for linear layers has a batch size of m. * gamma and beta are then learned per feature map, not per single pixel. (Linear layers: Per neuron.) * (3.3) Batch Normalization enables higher learning rates * BN normalizes activations. * Result: Changes to early layers don't amplify towards the end. * BN makes it less likely to get stuck in the saturating parts of nonlinearities. * BN makes training more resilient to parameter scales. * Usually, large learning rates cannot be used as they tend to scale up parameters. Then any change to a parameter amplifies through the network and can lead to gradient explosions. * With BN gradients actually go down as parameters increase. Therefore, higher learning rates can be used. * (something about singular values and the Jacobian) * (3.4) Batch Normalization regularizes the model * Usually: Examples are seen on their own by the network. * With BN: Examples are seen in conjunction with other examples (mean, variance). * Result: Network can't easily memorize the examples any more. * Effect: BN has a regularizing effect. Dropout can be removed or decreased in strength. * (4) Experiments * (4.1) Activations over time ** They tested BN on MNIST with a 100x100x10 network. (One network with BN before each nonlinearity, another network without BN for comparison.) ** Batch Size was 60. ** The network with BN learned faster. Activations of neurons (their means and variances over several examples) seemed to be more consistent during training. ** Generalization of the BN network seemed to be better. * (4.2) ImageNet classification ** They applied BN to the Inception network. ** Batch Size was 32. ** During training they used (compared to original Inception training) a higher learning rate with more decay, no dropout, less L2, no local response normalization and less distortion/augmentation. ** They shuffle the data during training (i.e. each batch contains different examples). ** Depending on the learning rate, they either achieve the same accuracy (as in the non-BN network) in 14 times fewer steps (5x learning rate) or a higher accuracy in 5 times fewer steps (30x learning rate). ** BN enables training of Inception networks with sigmoid units (still a bit lower accuracy than ReLU). ** An ensemble of 6 Inception networks with BN achieved better accuracy than the previously best network for ImageNet. * (5) Conclusion ** BN is similar to a normalization layer suggested by Gülcehre and Bengio. However, they applied it to the outputs of nonlinearities. ** They also didn't have the beta and gamma parameters (i.e. their normalization could not learn the identity function). |
[link]
## Introduction * [Link to Paper](http://arxiv.org/pdf/1412.6071v4.pdf) * Spatial pooling layers are building blocks for Convolutional Neural Networks (CNNs). * Input to pooling operation is a $N_{in}$ x $N_{in}$ matrix and output is a smaller matrix $N_{out}$ x $N_{out}$. * Pooling operation divides $N_{in}$ x $N_{in}$ square into $N^2_{out}$ pooling regions $P_{i, j}$. * $P_{i, j}$ ⊂ $\{1, 2, . . . , N_{in}\}$ $\forall$ $(i, j) \in \{1, . . . , N_{out} \}^2$ ## MP2 * Refers to 2x2 max-pooling layer. * Popular choice for max-pooling operation. ### Advantages of MP2 * Fast. * Quickly reduces the size of the hidden layer. * Encodes a degree of invariance with respect to translations and elastic distortions. ### Issues with MP2 * Disjoint nature of pooling regions. * Since size decreases rapidly, stacks of back-to-back CNNs are needed to build deep networks. ## FMP * Reduces the spatial size of the image by a factor of *α*, where *α ∈ (1, 2)*. * Introduces randomness in terms of choice of pooling region. * Pooling regions can be chosen in a *random* or *pseudorandom* manner. * Pooling regions can be *disjoint* or *overlapping*. ## Generating Pooling Regions * Let $a_i$ and $b_i$ be 2 increasing sequences of integers, starting at 1 and ending at $N_{in}$. * Increments are either 1 or 2. * For *disjoint regions, $P = [a_{i−1}, a_{i − 1}] × [b_{j−1}, b_{j − 1}]$ * For *overlapping regions, $P = [a_{i−1}, a_i] × [b_{j−1}, b_j 1]$ * Pooling regions can be generated *randomly* by choosing the increment randomly at each step. * To generate pooling regions in a *peusdorandom* manner, choose $a_i$ = ceil($\alpha | (i+u))$, where $\alpha \in (1, 2)$ with some $u \in (0, 1)$. * Each FMP layer uses a different pair of sequence. * An FMP network can be thought of as an ensemble of similar networks, with each different pooling-region configuration defining a different member of the ensemble. ## Observations * *Random* FMP is good on its own but may underfit when combined with dropout or training data augmentation. * *Pseudorandom* approach generates more stable pooling regions. * *Overlapping* FMP performs better than *disjoint* FMP. ## Weakness * No justification is provided for the observations mentioned above. * It needs to be seen how performance is affected if the pooling layer in architectures like GoogLeNet. |
[link]
TLDR; The authors propose Progressive Neural Networks (ProgNN), a new way to do transfer learning without forgetting prior knowledge (as is done in finetuning). ProgNNs train a neural neural on task 1, freeze the parameters, and then train a new network on task 2 while introducing lateral connections and adapter functions from network 1 to network 2. This process can be repeated with further columns (networks). The authors evaluate ProgNNs on 3 RL tasks and find that they outperform finetuning-based approaches. #### Key Points - Finetuning is a destructive process that forgets previous knowledge. We don't want that. - Layer h_k in network 3 gets additional lateral connections from layers h_(k-1) in network 2 and network 1. Parameters of those connections are learned, but network 2 and network 1 are frozen during training of network 3. - Downside: # of Parameters grows quadratically with the number of tasks. Paper discussed some approaches to address the problem, but not sure how well these work in practice. - Metric: AUC (Average score per episode during training) as opposed to final score. Transfer score = Relative performance compared with single net baseline. - Authors use Average Perturbation Sensitivity (APS) and Average Fisher Sensitivity (AFS) to analyze which features/layers from previous networks are actually used in the newly trained network. - Experiment 1: Variations of Pong game. Baseline that finetunes only final layer fails to learn. ProgNN beats other baselines and APS shows re-use of knowledge. - Experiment 2: Different Atari games. ProgNets result in positive Transfer 8/12 times, negative transfer 2/12 times. Negative transfer may be a result of optimization problems. Finetuning final layers fails again. ProgNN beats other approaches. - Experiment 3: Labyrinth, 3D Maze. Pretty much same result as other experiments. #### Notes - It seems like the assumption is that layer k always wants to transfer knowledge from layer (k-1). But why is that true? Network are trained on different tasks, so the layer representations, or even numbers of layers, may be completely different. And Once you introduce lateral connections from all layers to all other layers the approach no longer scales. - Old tasks cannot learn from new tasks. Unlike humans. - Gating or residuals for lateral connection could make sense to allow to network to "easily" re-use previously learned knowledge. - Why use AUC metric? I also would've liked to see the final score. Maybe there's a good reason for this, but the paper doesn't explain. - Scary that finetuning the final layer only fails in most experiments. That's a very commonly used approach in non-RL domains. - Someone should try this on non-RL tasks. - What happens to training time and optimization difficult as you add more columns? Seems prohibitively expensive. |
[link]
#### Very Brief Summary: This paper combines stochastic variational inference with memory-augmented recurrent neural networks. The authors test 4 variants of their models against the Variational Recurrent Neural Network on 7 artificial tasks requiring long term memory. The reported log-likelihood lower bound is not obviously improved by the new models on all tasks but is slightly better on tasks requiring high capacity memory. #### Slightly Less Brief Summary: The authors propose a general class of generative models for time-series data with both deterministic and stochastic latents. The deterministic latents, $h_t$, evolve as a recurrent net with augmented memory and the stochastic latents, $z_t$ are gaussians whose mean and variance are a deterministic function of $h_t$. The observations at each time-step $x_t$ are also gaussians whose mean and variance are parametrised by a function of $h_{<t}, x_{<t}$. #### Generative Temporal Models without Augmented Memory: The family of generative temporal models is fairly broad and includes kalman filters, non-linear dynamical systems, hidden-markov models and switching state-space models. More recent non-linear models such as the variational RNN are most similar to the new models in this paper. In general all of the mentioned temporal models can be written as: $P_\theta(x_{\leq T}, z_{\leq T} ) = \prod_t P_\theta(x_t | f_x(z_{\leq t}, x_{\leq t}))P_\theta(z_t | f_z(z_{\leq t}, x_{\leq t}))$ The differences between models then come from the the exact forms of $f_x$ and $f_z$ with most models making strong conditional independence assumptions and/or having linear dependence. For example in a Gaussian State Space model both $f_x$ and $f_z$ are linear, the latents form a first order Markov chain and the observations $x_t$ are conditionally independent of everything given $z_t$. In the Variational Recurrent Neural Net (VRNN) an additional deterministic latent variable $h_t$ is introduced and at each time-step $x_t$ is the output of a VAE whose prior $z_t$ is conditioned on $h_t$. $h_t$ evolves as an RNN. #### Types of Model with Augmented Memory: This paper follows the same strategy as the VRNN but adds more structure to the underlying recurrent neural net. The authors motivate this by saying that the VRNN "scales poorly when higher capacity storage is required". * "Introspective" Model: In the first augmented memory model, the deterministic latent M_t is simply a concatenation of the last $L$ latent stochastic variables $z_t$. A soft method of attention over the latent memory is used to generate a "memory context" vector at each time step. The observed output $x_t$ is a gaussian with mean and variance parameterised by the "memory context' and the stochastic latent $z_t$. Because this model does not learn to write to memory it is faster to train. * In the later models the memory read and write operations are the same as those in the neural turing machine or differentiable neural computer. #### My Two Cents: In some senses this paper feels fairly inevitable since VAE's have already been married with RNNs and so it's a small leap to add augmented memory. The actual read write operations introduced in the "introspective" model feel a little hacky and unprincipled. The actual images generated are quite impressive. I'd like to see how these kind of models do on language generation tasks and wether they can be adapted for question answering. |