Welcome to ShortScience.org! 
[link]
# Very Short The authors propose **learning** an optimizer **to** optimally **learn** a function (the *optimizee*) which is being trained **by gradient descent**. This optimizer, a recurrent neural network, is trained to make optimal parameter updates to the optimizee **by gradient descent**. # Short Let's suppose we have a stochastic function $f: \mathbb R^{\text{dim}(\theta)} \rightarrow \mathbb R^+$, (the *optimizee*) which we wish to minimize with respect to $\theta$. Note that this is the typical situation we encounter when training a neural network with Stochastic Gradient Descent  where the stochasticity comes from sampling random minibatches of the data (the data is omitted as an argument here). The "vanilla" gradient descent update is: $\theta_{t+1} = \theta_t  \alpha_t \nabla_{\theta_t} f(\theta_t)$, where $\alpha_t$ is some learning rate. Other optimizers (Adam, RMSProp, etc) replace the multiplication of the gradient by $\alpha_t$ with some sort of weighted sum of the history of gradients. This paper proposes to apply an optimization step $\theta_{t+1} = \theta_t + g_t$, where the update $g_t \in \mathbb R^{\text{dim}(\theta)}$ is defined by a recurrent network $m_\phi$: $$(g_t, h_{t+1}) := m_\phi (\nabla_{\theta_t} f(\theta_t), h_t)$$ Where in their implementation, $h_t \in \mathbb R^{\text{dim}(\theta)}$ is the hidden state of the recurrent network. To make the number of parameters in the optimizer manageable, they implement their recurrent network $m$ as a *coordinatewise* LSTM (i.e. A set of $\text{dim}(\theta)$ small LSTMs that share parameters $\phi$). They train the optimizer networks's parameters $\phi$ by "unrolling" T subsequent steps of optimization, and minimizing: $$\mathcal L(\phi) := \mathbb E_f[f(\theta^*(f, \phi))] \approx \frac1T \sum_{t=1}^T f(\theta_t)$$ Where $\theta^*(f, \phi)$ are the final optimizee parameters. In order to avoid computing second derivatives while calculating $\frac{\partial \mathcal L(\phi)}{\partial \phi}$, they make the approximation $\frac{\partial}{\partial \phi} \nabla_{\theta_t}f(\theta_t) \approx 0$ (corresponding to the dotted lines in the figure, along which gradients are not backpropagated). https://i.imgur.com/HMaCeip.png **The computational graph of the optimization of the optimizer, unrolled across 3 timesteps. Note that $\nabla_t := \nabla_{\theta_t}f(\theta_t)$. The dotted line indicates that we do not backpropagate across this path.** The authors demonstrate that their method usually outperforms traditional optimizers (ADAM, RMSProp, SGD, NAG), on a synthetic dataset, MNIST, CIFAR10, and Neural Style Transfer. They argue that their algorithm constitutes a form of transfer learning, since a pretrained optimizer can be applied to accelerate training of a newly initialized network. 
[link]
**Problem Setting:** Sequence to Sequence learning (seq2seq) is one of the most successful techniques in machine learning nowadays. The basic idea is to encode a sequence into a vector (or a sequence of vectors if using attention based encoder) and then use a recurrent decoder to decode the target sequence conditioned on the encoder output. While researchers have explored various architectural changes to this basic encoderdecoder model, the standard way of training such seq2seq models is to maximize the likelihood of each successive target word conditioned on the input sequence and the *gold* history of target words. This is also known as *teacherforcing* in RNN literature. Such an approach has three major issues: 1. **Exposure Bias:** Since we teacherforce the model with *gold* history during training, the model is never exposed to its errors during training. At test time, we will not have access to *gold* history and we feed the history generated by the model. If it is erroneous, the model does not have any clue about how to rectify it. 2. **LossEvaluation Mismatch:** While we evaluate the model using sequence level metrics (such as BLEU for Machine Translation), we are training the model with word level cross entropy loss. 3. **Label bias:** Since the word probabilities are normalized at each time step (by using softmax over the final layer of the decoder), this can result in label bias if we vary the number of possible candidates in each step. More about this later. **Solution:** This paper proposes an alternative training procedure for seq2seq models which attempt to solve all the 3 major issues listed above. The idea is to pose seq2seq learning as beamsearch optimization problem. Authors begin by removing the final softmax activation function from the decoder. Now instead of probability distributions, we will get score for next possible word. Then the training procedure is changed as follows: At every time step $t$, they maintain a set $S_t$ of $K$ candidate sequences of length $t$. Now the loss function is defined with the following characteristics: 1. If the *gold* subsequence of length $t$ is in set $S_t$ and the score for *gold* subsequence exceeds the score of the $K$th ranked candidate by a margin, the model incurs no loss. Now the candidates for next timestep are chosen in a way similar to regular beamsearch with beamsize $K$. 2. If the *gold* subsequence of length $t$ is in set $S_t$ and it is the $K$th ranked candidate, then the loss will push the *gold* sequence up by increasing its score. The candidates for next timestep are chosen in a way similar as first case. 3. If the *gold* subsequence of length $t$ is NOT in set $S_t$, then the score of the *gold* sequence is increased to be higher than $K$th ranked candidate by a margin. In this case, candidates for next step or chosen by only considering *gold* word at time $t$ and getting its top$K$ successors. 4. Further, since we want the full *gold* sequence to be at top of the beam at the end of the search, when $t=T$, the loss is modified to require the score of *gold* sequence to exceed the score of the *highest* ranked incorrect prediction by a margin. This nonprobabilistic training method has several advantages: * The model is trained in a similar way it would be tested, since we use beamsearch during training as well as testing. Hence this helps to eliminate exposure bias. * The score based loss can be easily scaled by a mistakespecific cost function. For example, in MT, one could use a cost function which is inversely proportional to BLEU score. So there is no lossevaluation mismatch. * Each time step can have different set of successor words based on any hard constraints in the problem. Note that the model is nonprobabilistic and hence this varying successor function will not introduce any label bias. Refer [this set of slides][1] for an excellent illustration of label bias. Cost of forwardprop grows linearly with respect to beam size $K$. However, GPU implementation should help to reduce this cost. Authors propose a clever way of doing BPTT which makes the backprop almost same cost as ordinary seq2seq training. **Additional Tricks** 1. Authors pretrain the seq2seq model with regular word level crossentropy loss and this is crucial since random initialization did not work. 2. Authors use "curriculum beam" strategy in training where they start with beam size of 2 and increase the beam size by 1 for every 2 epochs until it reaches the required beam size. You have to train your model with training beam size of at least test beam size + 1. (i.e $K_{tr} >= K_{te} + 1$). 3. When you use dropout, you need to be careful to use the same dropout value during backprop. Authors do this by sharing a single dropout across all sequences in a time step. **Experiments** Authors compare the proposed model against basic seq2seq in word ordering, dependency parsing and MT tasks. The proposed model achieves significant improvement over the strong baseline. **Related Work:** The whole idea of the paper is based on [learning as search optimization (LaSO) framework][2] of Daume III and Marcu (2005). Other notable related work are training seq2seq models using mix of crossentropy and REINFORCE called [MIXER][3] and [an actorcritic based seq2seq training][4]. Authors compare with MIXER and they do significantly better than MIXER. **My two cents:** This is one of the important research directions in my opinion. While other recent methods attempt to use reinforcement learning to avoid the issues in wordlevel crossentropy training, this paper proposes a really simple score based solution which works very well. While most of the language generation research is stuck with probabilistic framework (I am saying this w.r.t Deep NLP research), this paper highlights the benefits on nonprobabilistic generation models. I see this as one potential way of avoiding the nasty scalability issues that come with softmax based generative models. [1]: http://www.cs.stanford.edu/~nmramesh/crf [2]: https://www.isi.edu/~marcu/papers/daume05laso.pdf [3]: http://arxiv.org/pdf/1511.06732v7.pdf [4]: https://arxiv.org/pdf/1607.07086v2.pdf 
[link]
**Idea:** With the growing use of visual explanation systems of machine learning models such as saliency maps, there needs to be a standardized method of verifying if a saliency method is correctly describing the underlying ML model. **Solution:** In this paper two Sanity Checks have been proposed to verify the accuracy and the faithfulness of a saliency method: * *Model parameter randomization test:* In this sanity check the outputs of a saliency method on a trained model is compared to that of the same method on an untrained randomly parameterized model. If these images are similar/identical then this saliency method does not correctly describe the model. In the course of this experiment it is found that certain methods such as the Guided BackProp are constant in their explanations despite alterations in the model. * *Data Randomization Test:* This method explores the relationship of saliency methods to data and their associated labels. In this test, the labels of the training data are randomized thus there should be no definite pattern describing the model (Since the model is as good as randomly guessing an output label). If there is a definite pattern, this shows that the saliency methods are independent of the underlying model/training data labels. In this test as well Guided BackProp did not fare well, implying this saliency method is as good as an edge detector as opposed to a ML explainer. Thus this paper makes a valid argument toward having standardized tests that an interpretation model must satisfy to be deemed accurate or faithful. 
[link]
The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization** $$W \sim U \left [  \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$ where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$. Showing some ways **how to debug neural networks** might be another reason to read the paper. The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign). However, no regularization was used and many minibatch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much. Questions that remain open for me: * [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9) * Figure 4: Why is this plot not simply completely dependent on the data? * Is softsign still used? Why not? * If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?
1 Comments

[link]
This paper presents a novel neural network approach (though see [here](https://www.facebook.com/hugo.larochelle.35/posts/172841743130126?pnref=story) for a discussion on prior work) to density estimation, with a focus on image modeling. At its core, it exploits the following property on the densities of random variables. Let $x$ and $z$ be two random variables of equal dimensionality such that $x = g(z)$, where $g$ is some bijective and deterministic function (we'll note its inverse as $f = g^{1}$). Then the change of variable formula gives us this relationship between the densities of $x$ and $z$: $p_X(x) = p_Z(z) \left{\rm det}\left(\frac{\partial g(z)}{\partial z}\right)\right^{1}$ Moreover, since the determinant of the Jacobian matrix of the inverse $f$ of a function $g$ is simply the inverse of the Jacobian of the function $g$, we can also write: $p_X(x) = p_Z(f(x)) \left{\rm det}\left(\frac{\partial f(x)}{\partial x}\right)\right$ where we've replaced $z$ by its deterministically inferred value $f(x)$ from $x$. So, the core of the proposed model is in proposing a design for bijective functions $g$ (actually, they design its inverse $f$, from which $g$ can be derived by inversion), that have the properties of being easily invertible and having an easytocompute determinant of Jacobian. Specifically, the authors propose to construct $f$ from various modules that all preserve these properties and allows to construct highly nonlinear $f$ functions. Then, assuming a simple choice for the density $p_Z$ (they use a multidimensional Gaussian), it becomes possible to both compute $p_X(x)$ tractably and to sample from that density, by first samples $z\sim p_Z$ and then computing $x=g(z)$. The building blocks for constructing $f$ are the following: **Coupling layers**: This is perhaps the most important piece. It simply computes as its output $b\odot x + (1b) \odot (x \odot \exp(l(b\odot x)) + m(b\odot x))$, where $b$ is a binary mask (with half of its values set to 0 and the others to 1) over the input of the layer $x$, while $l$ and $m$ are arbitrarily complex neural networks with input and output layers of equal dimensionality. In brief, for dimensions for which $b_i = 1$ it simply copies the input value into the output. As for the other dimensions (for which $b_i = 0$) it linearly transforms them as $x_i * \exp(l(b\odot x)_i) + m(b\odot x)_i$. Crucially, the bias ($m(b\odot x)_i$) and coefficient ($\exp(l(b\odot x)_i)$) of the linear transformation are nonlinear transformations (i.e. the output of neural networks) that only have access to the masked input (i.e. the nontransformed dimensions). While this layer might seem odd, it has the important property that it is invertible and the determinant of its Jacobian is simply $\exp(\sum_i (1b_i) l(b\odot x)_i)$. See Section 3.3 for more details on that. **Alternating masks**: One important property of coupling layers is that they can be stacked (i.e. composed), and the resulting composition is still a bijection and is invertible (since each layer is individually a bijection) and has a tractable determinant for its Jacobian (since the Jacobian of the composition of functions is simply the multiplication of each function's Jacobian matrix, and the determinant of the product of square matrices is the product of the determinant of each matrix). This is also true, even if the mask $b$ of each layer is different. Thus, the authors propose using masks that alternate across layer, by masking a different subset of (half of) the dimensions. For images, they propose using masks with a checkerboard pattern (see Figure 3). Intuitively, alternating masks are better because then after at least 2 layers, all dimensions have been transformed at least once. **Squeezing operations**: Squeezing operations corresponds to a reorganization of a 2D spatial layout of dimensions into 4 sets of features maps with spatial resolutions reduced by half (see Figure 3). This allows to expose multiple scales of resolutions to the model. Moreover, after a squeezing operation, instead of using a checkerboard pattern for masking, the authors propose to use a per channel masking pattern, so that "the resulting partitioning is not redundant with the previous checkerboard masking". See Figure 3 for an illustration. Overall, the models used in the experiments usually stack a few of the following "chunks" of layers: 1) a few coupling layers with alternating checkboard masks, 2) followed by squeezing, 3) followed by a few coupling layers with alternating channelwise masks. Since the output of each layerschunk must technically be of the same size as the input image, this could become expensive in terms of computations and space when using a lot of layers. Thus, the authors propose to explicitly pass on (copy) to the very last layer ($z$) half of the dimensions after each layerschunk, adding another chunk of layers only on the other half. This is illustrated in Figure 4b. Experiments on CIFAR10, and 32x32 and 64x64 versions of ImageNet show that the proposed model (coined the realvalued nonvolume preserving or Real NVP) has competitive performance (in bits per dimension), though slightly worse than the Pixel RNN. **My Two Cents** The proposed approach is quite unique and thought provoking. Most interestingly, it is the only powerful generative model I know that combines A) a tractable likelihood, B) an efficient / onepass sampling procedure and C) the explicit learning of a latent representation. While achieving this required a model definition that is somewhat unintuitive, it is nonetheless mathematically really beautiful! I wonder to what extent Real NVP is penalized in its results by the fact that it models pixels as realvalued observations. First, it implies that its estimate of bits/dimensions is an upper bound on what it could be if the uniform subpixel noise was integrated out (see Equations 345 of [this paper](http://arxiv.org/pdf/1511.01844v3.pdf)). Moreover, the authors had to apply a nonlinear transformation (${\rm logit}(\alpha + (1\alpha)\odot x)$) to the pixels, to spread the $[0,255]$ interval further over the reals. Since the Pixel RNN models pixels as discrete observations directly, the Real NVP might be at a disadvantage. I'm also curious to know how easy it would be to do conditional inference with the Real NVP. One could imagine doing approximate MAP conditional inference, by clamping the observed dimensions and doing gradient descent on the loglikelihood with respect to the value of remaining dimensions. This could be interesting for image completion, or for structured output prediction with realvalued outputs in general. I also wonder how expensive that would be. In all cases, I'm looking forward to saying interesting applications and variations of this model in the future! 