![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
This paper deals with the formal question of machine reading. It proposes a novel methodology for automatic dataset building for machine reading model evaluation. To do so, the authors leverage on news resources that are equipped with a summary to generate a large number of questions about articles by replacing the named entities of it. Furthermore a attention enhanced LSTM inspired reading model is proposed and evaluated. The paper is well-written and clear, the originality seems to lie on two aspects. First, an original methodology of question answering dataset creation, where context-query-answer triples are automatically extracted from news feeds. Such proposition can be considered as important because it opens the way for large model learning and evaluation. The second contribution is the addition of an attention mechanism to an LSTM reading model. the empirical results seem to show relevant improvement with respect to an up-to-date list of machine reading models. Given the lack of an appropriate dataset, the author provides a new dataset which scraped CNN and Daily Mail, using both the full text and abstract summaries/bullet points. The dataset was then anonymised (i.e. entity names removed). Next the author presents a two novel Deep long-short term memory models which perform well on the Cloze query task. ![]() |
[link]
### What is BN: * Batch Normalization (BN) is a normalization method/layer for neural networks. * Usually inputs to neural networks are normalized to either the range of [0, 1] or [-1, 1] or to mean=0 and variance=1. The latter is called *Whitening*. * BN essentially performs Whitening to the intermediate layers of the networks. ### How its calculated: * The basic formula is $x^* = (x - E[x]) / \sqrt{\text{var}(x)}$, where $x^*$ is the new value of a single component, $E[x]$ is its mean within a batch and `var(x)` is its variance within a batch. * BN extends that formula further to $x^{**} = gamma * x^* +$ beta, where $x^{**}$ is the final normalized value. `gamma` and `beta` are learned per layer. They make sure that BN can learn the identity function, which is needed in a few cases. * For convolutions, every layer/filter/kernel is normalized on its own (linear layer: each neuron/node/component). That means that every generated value ("pixel") is treated as an example. If we have a batch size of N and the image generated by the convolution has width=P and height=Q, we would calculate the mean (E) over `N*P*Q` examples (same for the variance). ### Theoretical effects: * BN reduces *Covariate Shift*. That is the change in distribution of activation of a component. By using BN, each neuron's activation becomes (more or less) a gaussian distribution, i.e. its usually not active, sometimes a bit active, rare very active. * Covariate Shift is undesirable, because the later layers have to keep adapting to the change of the type of distribution (instead of just to new distribution parameters, e.g. new mean and variance values for gaussian distributions). * BN reduces effects of exploding and vanishing gradients, because every becomes roughly normal distributed. Without BN, low activations of one layer can lead to lower activations in the next layer, and then even lower ones in the next layer and so on. ### Practical effects: * BN reduces training times. (Because of less Covariate Shift, less exploding/vanishing gradients.) * BN reduces demand for regularization, e.g. dropout or L2 norm. (Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.) * BN allows higher learning rates. (Because of less danger of exploding/vanishing gradients.) * BN enables training with saturating nonlinearities in deep networks, e.g. sigmoid. (Because the normalization prevents them from getting stuck in saturating ranges, e.g. very high/low values for sigmoid.) ![MNIST and neuron activations](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Batch_Normalization__performance_and_activations.png?raw=true "MNIST and neuron activations") *BN applied to MNIST (a), and activations of a randomly selected neuron over time (b, c), where the middle line is the median activation, the top line is the 15th percentile and the bottom line is the 85th percentile.* ------------------------- ### Rough chapter-wise notes * (2) Towards Reducing Covariate Shift * Batch Normalization (*BN*) is a special normalization method for neural networks. * In neural networks, the inputs to each layer depend on the outputs of all previous layers. * The distributions of these outputs can change during the training. Such a change is called a *covariate shift*. * If the distributions stayed the same, it would simplify the training. Then only the parameters would have to be readjusted continuously (e.g. mean and variance for normal distributions). * If using sigmoid activations, it can happen that one unit saturates (very high/low values). That is undesired as it leads to vanishing gradients for all units below in the network. * BN fixes the means and variances of layer inputs to specific values (zero mean, unit variance). * That accomplishes: * No more covariate shift. * Fixes problems with vanishing gradients due to saturation. * Effects: * Networks learn faster. (As they don't have to adjust to covariate shift any more.) * Optimizes gradient flow in the network. (As the gradient becomes less dependent on the scale of the parameters and their initial values.) * Higher learning rates are possible. (Optimized gradient flow reduces risk of divergence.) * Saturating nonlinearities can be safely used. (Optimized gradient flow prevents the network from getting stuck in saturated modes.) * BN reduces the need for dropout. (As it has a regularizing effect.) * How BN works: * BN normalizes layer inputs to zero mean and unit variance. That is called *whitening*. * Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change. * A proper method has to include the current example *and* all previous examples in the normalization step. * This leads to calculating in covariance matrix and its inverse square root. That's expensive. The authors found a faster way. * (3) Normalization via Mini-Batch Statistics * Each feature (component) is normalized individually. (Due to cost, differentiability.) * Normalization according to: `componentNormalizedValue = (componentOldValue - E[component]) / sqrt(Var(component))` * Normalizing each component can reduce the expressitivity of nonlinearities. Hence the formula is changed so that it can also learn the identiy function. * Full formula: `newValue = gamma * componentNormalizedValue + beta` (gamma and beta learned per component) * E and Var are estimated for each mini batch. * BN is fully differentiable. Formulas for gradients/backpropagation are at the end of chapter 3 (page 4, left). * (3.1) Training and Inference with Batch-Normalized Networks * During test time, E and Var of each component can be estimated using all examples or alternatively with moving averages estimated during training. * During test time, the BN formulas can be simplified to a single linear transformation. * (3.2) Batch-Normalized Convolutional Networks * Authors recommend to place BN layers after linear/fully-connected layers and before the ninlinearities. * They argue that the linear layers have a better distribution that is more likely to be similar to a gaussian. * Placing BN after the nonlinearity would also not eliminate covariate shift (for some reason). * Learning a separate bias isn't necessary as BN's formula already contains a bias-like term (beta). * For convolutions they apply BN equally to all features on a feature map. That creates effective batch sizes of m\*pq, where m is the number of examples in the batch and p q are the feature map dimensions (height, width). BN for linear layers has a batch size of m. * gamma and beta are then learned per feature map, not per single pixel. (Linear layers: Per neuron.) * (3.3) Batch Normalization enables higher learning rates * BN normalizes activations. * Result: Changes to early layers don't amplify towards the end. * BN makes it less likely to get stuck in the saturating parts of nonlinearities. * BN makes training more resilient to parameter scales. * Usually, large learning rates cannot be used as they tend to scale up parameters. Then any change to a parameter amplifies through the network and can lead to gradient explosions. * With BN gradients actually go down as parameters increase. Therefore, higher learning rates can be used. * (something about singular values and the Jacobian) * (3.4) Batch Normalization regularizes the model * Usually: Examples are seen on their own by the network. * With BN: Examples are seen in conjunction with other examples (mean, variance). * Result: Network can't easily memorize the examples any more. * Effect: BN has a regularizing effect. Dropout can be removed or decreased in strength. * (4) Experiments * (4.1) Activations over time ** They tested BN on MNIST with a 100x100x10 network. (One network with BN before each nonlinearity, another network without BN for comparison.) ** Batch Size was 60. ** The network with BN learned faster. Activations of neurons (their means and variances over several examples) seemed to be more consistent during training. ** Generalization of the BN network seemed to be better. * (4.2) ImageNet classification ** They applied BN to the Inception network. ** Batch Size was 32. ** During training they used (compared to original Inception training) a higher learning rate with more decay, no dropout, less L2, no local response normalization and less distortion/augmentation. ** They shuffle the data during training (i.e. each batch contains different examples). ** Depending on the learning rate, they either achieve the same accuracy (as in the non-BN network) in 14 times fewer steps (5x learning rate) or a higher accuracy in 5 times fewer steps (30x learning rate). ** BN enables training of Inception networks with sigmoid units (still a bit lower accuracy than ReLU). ** An ensemble of 6 Inception networks with BN achieved better accuracy than the previously best network for ImageNet. * (5) Conclusion ** BN is similar to a normalization layer suggested by Gülcehre and Bengio. However, they applied it to the outputs of nonlinearities. ** They also didn't have the beta and gamma parameters (i.e. their normalization could not learn the identity function). ![]() |
[link]
This paper presents a novel neural network approach (though see [here](https://www.facebook.com/hugo.larochelle.35/posts/172841743130126?pnref=story) for a discussion on prior work) to density estimation, with a focus on image modeling. At its core, it exploits the following property on the densities of random variables. Let $x$ and $z$ be two random variables of equal dimensionality such that $x = g(z)$, where $g$ is some bijective and deterministic function (we'll note its inverse as $f = g^{-1}$). Then the change of variable formula gives us this relationship between the densities of $x$ and $z$: $p_X(x) = p_Z(z) \left|{\rm det}\left(\frac{\partial g(z)}{\partial z}\right)\right|^{-1}$ Moreover, since the determinant of the Jacobian matrix of the inverse $f$ of a function $g$ is simply the inverse of the Jacobian of the function $g$, we can also write: $p_X(x) = p_Z(f(x)) \left|{\rm det}\left(\frac{\partial f(x)}{\partial x}\right)\right|$ where we've replaced $z$ by its deterministically inferred value $f(x)$ from $x$. So, the core of the proposed model is in proposing a design for bijective functions $g$ (actually, they design its inverse $f$, from which $g$ can be derived by inversion), that have the properties of being easily invertible and having an easy-to-compute determinant of Jacobian. Specifically, the authors propose to construct $f$ from various modules that all preserve these properties and allows to construct highly non-linear $f$ functions. Then, assuming a simple choice for the density $p_Z$ (they use a multidimensional Gaussian), it becomes possible to both compute $p_X(x)$ tractably and to sample from that density, by first samples $z\sim p_Z$ and then computing $x=g(z)$. The building blocks for constructing $f$ are the following: **Coupling layers**: This is perhaps the most important piece. It simply computes as its output $b\odot x + (1-b) \odot (x \odot \exp(l(b\odot x)) + m(b\odot x))$, where $b$ is a binary mask (with half of its values set to 0 and the others to 1) over the input of the layer $x$, while $l$ and $m$ are arbitrarily complex neural networks with input and output layers of equal dimensionality. In brief, for dimensions for which $b_i = 1$ it simply copies the input value into the output. As for the other dimensions (for which $b_i = 0$) it linearly transforms them as $x_i * \exp(l(b\odot x)_i) + m(b\odot x)_i$. Crucially, the bias ($m(b\odot x)_i$) and coefficient ($\exp(l(b\odot x)_i)$) of the linear transformation are non-linear transformations (i.e. the output of neural networks) that only have access to the masked input (i.e. the non-transformed dimensions). While this layer might seem odd, it has the important property that it is invertible and the determinant of its Jacobian is simply $\exp(\sum_i (1-b_i) l(b\odot x)_i)$. See Section 3.3 for more details on that. **Alternating masks**: One important property of coupling layers is that they can be stacked (i.e. composed), and the resulting composition is still a bijection and is invertible (since each layer is individually a bijection) and has a tractable determinant for its Jacobian (since the Jacobian of the composition of functions is simply the multiplication of each function's Jacobian matrix, and the determinant of the product of square matrices is the product of the determinant of each matrix). This is also true, even if the mask $b$ of each layer is different. Thus, the authors propose using masks that alternate across layer, by masking a different subset of (half of) the dimensions. For images, they propose using masks with a checkerboard pattern (see Figure 3). Intuitively, alternating masks are better because then after at least 2 layers, all dimensions have been transformed at least once. **Squeezing operations**: Squeezing operations corresponds to a reorganization of a 2D spatial layout of dimensions into 4 sets of features maps with spatial resolutions reduced by half (see Figure 3). This allows to expose multiple scales of resolutions to the model. Moreover, after a squeezing operation, instead of using a checkerboard pattern for masking, the authors propose to use a per channel masking pattern, so that "the resulting partitioning is not redundant with the previous checkerboard masking". See Figure 3 for an illustration. Overall, the models used in the experiments usually stack a few of the following "chunks" of layers: 1) a few coupling layers with alternating checkboard masks, 2) followed by squeezing, 3) followed by a few coupling layers with alternating channel-wise masks. Since the output of each layers-chunk must technically be of the same size as the input image, this could become expensive in terms of computations and space when using a lot of layers. Thus, the authors propose to explicitly pass on (copy) to the very last layer ($z$) half of the dimensions after each layers-chunk, adding another chunk of layers only on the other half. This is illustrated in Figure 4b. Experiments on CIFAR-10, and 32x32 and 64x64 versions of ImageNet show that the proposed model (coined the real-valued non-volume preserving or Real NVP) has competitive performance (in bits per dimension), though slightly worse than the Pixel RNN. **My Two Cents** The proposed approach is quite unique and thought provoking. Most interestingly, it is the only powerful generative model I know that combines A) a tractable likelihood, B) an efficient / one-pass sampling procedure and C) the explicit learning of a latent representation. While achieving this required a model definition that is somewhat unintuitive, it is nonetheless mathematically really beautiful! I wonder to what extent Real NVP is penalized in its results by the fact that it models pixels as real-valued observations. First, it implies that its estimate of bits/dimensions is an upper bound on what it could be if the uniform sub-pixel noise was integrated out (see Equations 3-4-5 of [this paper](http://arxiv.org/pdf/1511.01844v3.pdf)). Moreover, the authors had to apply a non-linear transformation (${\rm logit}(\alpha + (1-\alpha)\odot x)$) to the pixels, to spread the $[0,255]$ interval further over the reals. Since the Pixel RNN models pixels as discrete observations directly, the Real NVP might be at a disadvantage. I'm also curious to know how easy it would be to do conditional inference with the Real NVP. One could imagine doing approximate MAP conditional inference, by clamping the observed dimensions and doing gradient descent on the log-likelihood with respect to the value of remaining dimensions. This could be interesting for image completion, or for structured output prediction with real-valued outputs in general. I also wonder how expensive that would be. In all cases, I'm looking forward to saying interesting applications and variations of this model in the future! ![]() |
[link]
Rozsa et al. describe an adersarial attack against OpenMax [1] by directly targeting the logits. Specifically, they assume a network using OpenMax instead of a SoftMax layer to compute the final class probabilities. OpenMax allows “open-set” networks by also allowing to reject input samples. By directly targeting the logits of the trained network, i.e. iteratively pushing the logits in a target direction, it does not matter whether SoftMax or OpenMax layers are used on top, the network can be fooled in both cases. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
SGD is a widely used optimization method for training the parameters of some model f on some given task. Since the convergence of SGD is related to the variance of the stochastic gradient estimate, there's been a lot of work on trying to come up with such stochastic estimates with smaller variance. This paper does it using an importance sampling (IS) Monte Carlo estimate of the gradient, and learning the proposal distribution $q$ of the IS estimate. The proposal distribution $q$ is parametrized in some way, and is trained to minimize the variance of the gradient estimate. It is trained simultaneously while the model $f$ that SGD (i.e. the SGD that uses IS to get its gradient) is training. To make this whole story more recursive, the proposal distribution $q$ is also trained with SGD :-) This makes sense, since one expects the best proposal to depend on the value of the parameters of model $f$, so the best proposal $q$ should vary as $f$ is trained. One application of this idea is in optimizing a classification model over a distribution that is imbalanced class-wise (e.g. there are classes with much fewer examples). In this case, the proposal distribution determines how frequently we sample examples from each class (conditioned on the class, training examples are chosen uniformly). #### My two cents This is a really cool idea. I particularly like the application to training on an imbalanced classification problem. People have mostly been using heuristics to tackle this problem, such as initially sampling each class equally as often, and then fine-tuning/calibrating the model using the real class proportions. This approach instead proposes a really elegant, coherent, solution to this problem. I would have liked to see a comparison with that aforementioned heuristic (for mainly selfish reasons :-) ). They instead compare with an importance sampling approach with proposal that assigns the same probability to each class, which is a reasonable alternative (though I don't know if it's used as often as the more heuristic approach). There are other applications, to matrix factorization and reinforcement learning, that are presented in the paper and seem neat, though I haven't gone through those as much. Overall, one of my favorite paper this year: it's original, tackles a problem for which I've always hated the heuristic solution I'm using now, proposes an elegant solution to it, and is applicable even more widely than that setting. ![]() |