Welcome to ShortScience.org! 
[link]
Cheney et al. study the robustness of deep neural networks, especially AlexNet, with regard to randomly dropping or perturbing weights. In particular, the authors consider three types of perturbations: synapse knockouts set random weights to zero, node knockouts set all weights corresponding to a set of neurons to zero, and weight perturbations add random Gaussian noise to the weights of a specific layer. These perturbations are studied on AlexNet, considering the top5 accuracy on ImageNet; perturbations are considered per layer. For example, Figure 1 (left) shows the influence on accuracy when knocking out synapses. As can be seen, the lower layers, especially the first convolutional layer, are impacted significantly by these perturbations. Similar observations, Figure 1 (right) are made for random perturbations of weights; although the impact is less significant. Especially highlevel features, i.e., the corresponding layers, seem to be robust to these kind of perturbations. The authors also provide evidence that these results extend to the top1 accuracy, as well as other architectures. For VGG, however, the impact is significantly less pronounced which may also be due to the employed dropout layers. https://i.imgur.com/78T6Gg2.png Figure 1: Left: Influence of setting weights in the corresponding layers to zero. Right: Influence of randomly perturbing weights of specific layers. Experiments are on ImageNet using AlexNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
### General Approach The Neural Tree Indexer (NTI) approach succeeded to reach 87.3\% test accuracy on SNLI. Here I'll attempt to clearly describe the steps involved based on the publication [1] and open sourced codebase [2]. NTI is a method to apply attention over a tree, specifically applied to sentence pairs. There are three main steps, each giving an incrementally more expressive representation of the input. It's worth noting that the tree is a full binary tree, so sentence lengths are padded to a factor of 2. In this case, the padded length used is $2^5 = 32$.  **Sequence Encoding.** Run an RNN over your sentence to get new hidden states for each element. $$h_t = f_1^{rnn}(i_t, h_{t1})$$  **Tree Encoding.** Using the hidden states from the previous step, use a variant of TreeLSTM to combine leaves until you have a single hidden state representing the entire sentence. Keep all of the intermediary hidden states for the next step. $$ h_t^{tree} = f^{tree}(h_l^{tree},h_r^{tree})$$  **Attention on Opposite Tree.** Until now we've only been describing how to encode a single sentence. When incorporating attention, we attend on the opposite tree by using the hidden states from the previous step. For instance, here is how we'd encode the premise (where the $p,h$ superscripts denote the premise or hypothesis, and $\vec{h}^{h,tree}$ denotes all of the hidden states of the nonattended hypothesis tree.: $$h_t^p = f_1^{rnn}(i_t^p, h_{t1}^p) \\ h_t^{p,tree} = f^{tree}(h_l^{p,tree},h_r^{p,tree}) \\ i_t^{p,attn} = f^{attn}(h_t^{p,tree}, \vec{h}^{h,tree}) \\ h_t^{p,attn} = f_2^{rnn}(i_t^{p,attn}, h_{t1}^{p,attn}) $$ ### Datasets NTI was evaluated on three datasets. Some variant of the model achieved stateoftheart in some category for each dataset:  SNLI [3]: Sentence Pair Classification.  WikiQA [4]: Answer Sentence Selection.  Stanford Sentiment TreeBank (SST) [5]: Sentence Classification. ### Implementation Details  Batch size is $32$ pairs (so $32$ of each premise and hypothesis).  Tree is full binary tree with $2^5 = 32$ leaves.  All sentences are padded left to length $32$, matching the full binary tree.  Steps 1 (sentence encoding) runs on all sentence simultaneously. So is Step 2 (tree encoding). Step 3 (attention) is done first on the premise, then on the hypothesis.  The variant of TreeLSTM used is SLSTM. It's available as a standard function in Chainer.  Dropout is applied liberally in each step. The keep rate is fixed at $80\%$.  MLP has $1$ hidden layer with dimension $1024$. Dimensions of the entire MLP are: $(2 \times H) \times 1024 \times 3$. $H$ is the size of the hidden states and is $300$.  Uses Chainer's Adam optimizer with $\alpha=0.0003,\beta_1=0.9,\beta_2=0.999,\epsilon=10^{8}$. Gradient clipping using L2 norm of $40$. Parameters periodically scaled by $0.00003$ (weight decay).  Weights are initialized uniformly random between $0.1$ and $0.1$. [1]: https://arxiv.org/abs/1607.04492 [2]: https://bitbucket.org/tsendeemts/nti/overview [3]: nlp.stanford.edu/projects/snli/ [4]: https://www.microsoft.com/enus/research/publication/wikiqaachallengedatasetforopendomainquestionanswering/ [5]: http://www.socher.org/index.php/Main/SemanticCompositionalityThroughRecursiveMatrixVectorSpaces 
[link]
#### Very Brief Summary: This paper combines stochastic variational inference with memoryaugmented recurrent neural networks. The authors test 4 variants of their models against the Variational Recurrent Neural Network on 7 artificial tasks requiring long term memory. The reported loglikelihood lower bound is not obviously improved by the new models on all tasks but is slightly better on tasks requiring high capacity memory. #### Slightly Less Brief Summary: The authors propose a general class of generative models for timeseries data with both deterministic and stochastic latents. The deterministic latents, $h_t$, evolve as a recurrent net with augmented memory and the stochastic latents, $z_t$ are gaussians whose mean and variance are a deterministic function of $h_t$. The observations at each timestep $x_t$ are also gaussians whose mean and variance are parametrised by a function of $h_{<t}, x_{<t}$. #### Generative Temporal Models without Augmented Memory: The family of generative temporal models is fairly broad and includes kalman filters, nonlinear dynamical systems, hiddenmarkov models and switching statespace models. More recent nonlinear models such as the variational RNN are most similar to the new models in this paper. In general all of the mentioned temporal models can be written as: $P_\theta(x_{\leq T}, z_{\leq T} ) = \prod_t P_\theta(x_t  f_x(z_{\leq t}, x_{\leq t}))P_\theta(z_t  f_z(z_{\leq t}, x_{\leq t}))$ The differences between models then come from the the exact forms of $f_x$ and $f_z$ with most models making strong conditional independence assumptions and/or having linear dependence. For example in a Gaussian State Space model both $f_x$ and $f_z$ are linear, the latents form a first order Markov chain and the observations $x_t$ are conditionally independent of everything given $z_t$. In the Variational Recurrent Neural Net (VRNN) an additional deterministic latent variable $h_t$ is introduced and at each timestep $x_t$ is the output of a VAE whose prior $z_t$ is conditioned on $h_t$. $h_t$ evolves as an RNN. #### Types of Model with Augmented Memory: This paper follows the same strategy as the VRNN but adds more structure to the underlying recurrent neural net. The authors motivate this by saying that the VRNN "scales poorly when higher capacity storage is required". * "Introspective" Model: In the first augmented memory model, the deterministic latent M_t is simply a concatenation of the last $L$ latent stochastic variables $z_t$. A soft method of attention over the latent memory is used to generate a "memory context" vector at each time step. The observed output $x_t$ is a gaussian with mean and variance parameterised by the "memory context' and the stochastic latent $z_t$. Because this model does not learn to write to memory it is faster to train. * In the later models the memory read and write operations are the same as those in the neural turing machine or differentiable neural computer. #### My Two Cents: In some senses this paper feels fairly inevitable since VAE's have already been married with RNNs and so it's a small leap to add augmented memory. The actual read write operations introduced in the "introspective" model feel a little hacky and unprincipled. The actual images generated are quite impressive. I'd like to see how these kind of models do on language generation tasks and wether they can be adapted for question answering. 
[link]
* The paper describes a method to separate content and style from each other in an image. * The style can then be transfered to a new image. * Examples: * Let a photograph look like a painting of van Gogh. * Improve a dark beach photo by taking the style from a sunny beach photo. ### How * They use the pretrained 19layer VGG net as their base network. * They assume that two images are provided: One with the *content*, one with the desired *style*. * They feed the content image through the VGG net and extract the activations of the last convolutional layer. These activations are called the *content representation*. * They feed the style image through the VGG net and extract the activations of all convolutional layers. They transform each layer to a *Gram Matrix* representation. These Gram Matrices are called the *style representation*. * How to calculate a *Gram Matrix*: * Take the activations of a layer. That layer will contain some convolution filters (e.g. 128), each one having its own activations. * Convert each filter's activations to a (1dimensional) vector. * Pick all pairs of filters. Calculate the scalar product of both filter's vectors. * Add the scalar product result as an entry to a matrix of size `#filters x #filters` (e.g. 128x128). * Repeat that for every pair to get the Gram Matrix. * The Gram Matrix roughly represents the *texture* of the image. * Now you have the content representation (activations of a layer) and the style representation (Gram Matrices). * Create a new image of the size of the content image. Fill it with random white noise. * Feed that image through VGG to get its content representation and style representation. (This step will be repeated many times during the image creation.) * Make changes to the new image using gradient descent to optimize a loss function. * The loss function has two components: * The mean squared error between the new image's content representation and the previously extracted content representation. * The mean squared error between the new image's style representation and the previously extracted style representation. * Add up both components to get the total loss. * Give both components a weight to alter for more/less style matching (at the expense of content matching). ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/A_Neural_Algorithm_for_Artistic_Style__examples.jpg?raw=true "Examples") *One example input image with different styles added to it.*  ### Rough chapterwise notes * Page 1 * A painted image can be decomposed in its content and its artistic style. * Here they use a neural network to separate content and style from each other (and to apply that style to an existing image). * Page 2 * Representations get more abstract as you go deeper in networks, hence they should more resemble the actual content (as opposed to the artistic style). * They call the feature responses in higher layers *content representation*. * To capture style information, they use a method that was originally designed to capture texture information. * They somehow build a feature space on top of the existing one, that is somehow dependent on correlations of features. That leads to a "stationary" (?) and multiscale representation of the style. * Page 3 * They use VGG as their base CNN. * Page 4 * Based on the extracted style features, they can generate a new image, which has equal activations in these style features. * The new image should match the style (texture, color, localized structures) of the artistic image. * The style features become more and more abtstract with higher layers. They call that multiscale the *style representation*. * The key contribution of the paper is a method to separate style and content representation from each other. * These representations can then be used to change the style of an existing image (by changing it so that its content representation stays the same, but its style representation matches the artwork). * Page 6 * The generated images look most appealing if all features from the style representation are used. (The lower layers tend to reflect small features, the higher layers tend to reflect larger features.) * Content and style can't be separated perfectly. * Their loss function has two terms, one for content matching and one for style matching. * The terms can be increased/decreased to match content or style more. * Page 8 * Previous techniques work only on limited or simple domains or used nonparametric approaches (see nonphotorealistic rendering). * Previously neural networks have been used to classify the time period of paintings (based on their style). * They argue that separating content from style might be useful and many other domains (other than transfering style of paintings to images). * Page 9 * The style representation is gathered by measuring correlations between activations of neurons. * They argue that this is somehow similar to what "complex cells" in the primary visual system (V1) do. * They note that deep convnets seem to automatically learn to separate content from style, probably because it is helpful for styleinvariant classification. * Page 9, Methods * They use the 19 layer VGG net as their basis. * They use only its convolutional layers, not the linear ones. * They use average pooling instead of max pooling, as that produced slightly better results. * Page 10, Methods * The information about the image that is contained in layers can be visualized. To do that, extract the features of a layer as the labels, then start with a white noise image and change it via gradient descent until the generated features have minimal distance (MSE) to the extracted features. * The build a style representation by calculating Gram Matrices for each layer. * Page 11, Methods * The Gram Matrix is generated in the following way: * Convert each filter of a convolutional layer to a 1dimensional vector. * For a pair of filters i, j calculate the value in the Gram Matrix by calculating the scalar product of the two vectors of the filters. * Do that for every pair of filters, generating a matrix of size #filters x #filters. That is the Gram Matrix. * Again, a white noise image can be changed with gradient descent to match the style of a given image (i.e. minimize MSE between two Gram Matrices). * That can be extended to match the style of several layers by measuring the MSE of the Gram Matrices of each layer and giving each layer a weighting. * Page 12, Methods * To transfer the style of a painting to an existing image, proceed as follows: * Start with a white noise image. * Optimize that image with gradient descent so that it minimizes both the content loss (relative to the image) and the style loss (relative to the painting). * Each distance (content, style) can be weighted to have more or less influence on the loss function. 
[link]
The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization** $$W \sim U \left [  \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$ where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$. Showing some ways **how to debug neural networks** might be another reason to read the paper. The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign). However, no regularization was used and many minibatch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much. Questions that remain open for me: * [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9) * Figure 4: Why is this plot not simply completely dependent on the data? * Is softsign still used? Why not? * If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?
1 Comments
