ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron C. and Salakhutdinov, Ruslan and Zemel, Richard S. and Bengio, Yoshua
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 8 years ago

TLDR; The authors use an attention mechanism in image caption generation, allowing the decoder RNN focus on specific parts of the image. In order find the correspondence between words and image patches, the RNN uses a lower convolutional layer as its input (before pooling). The authors propose both a "hard" attention (trained using sampling methods) and "soft" attention (trained end-to-end) mechanism, and show qualitatively that the decoder focuses on sensible regions while generating text, adding an additional layer of interpretability to the model. The attention-based models achieve state-of-the art on Flickr8k, Flickr30 and MS Coco.

#### Key Points

- To find image correspondence use lower convolutional layers to attend to.
- Two attention mechanisms: Soft and hard. Depending on evaluation metric (BLEU vs. METERO) one or the other performs better.
- Largest data set (MS COCO) takes 3 days to train on Titan Black GPU. Oxford VGG.
- Soft attention is same as for seq2seq models.
- Attention weights are visualized by upsampling and applying a Gaussian

#### Notes/Questions

- Would've liked to see an explanation of when/how soft vs. hard attention does better.
- What is the computational overhead of using the attention mechanism? Is it significant?

jmlr.org
scholar.google.com

DRAW: A Recurrent Neural Network For Image Generation
Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by José Manuel Rodríguez Sotelo 8 years ago

The paper introduces a sequential variational auto-encoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data.

#### What is DRAW:
The deep recurrent attention writer (DRAW) model has two differences with respect to other variational auto-encoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder.

#### What do we gain?
The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem.

#### What follows?
A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network.

#### Like:
* As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way.
* The attention model is fully differentiable.

#### Dislike:
* I think a better exposition of the attention mechanism would improve this paper.

dx.doi.org
sci-hub
scholar.google.com

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Girshick, Ross B. and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra
Conference and Computer Vision and Pattern Recognition - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by nandini 6 years ago

# Object detection system overview.

https://i.imgur.com/vd2YUy3.png

1. takes an input image,
2. extracts around 2000 bottom-up region proposals,
3. computes features for each proposal using a large convolutional neural network (CNN), and then
4. classifies each region using class-specific linear SVMs.
* R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010.
* On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%.

## There is a 2 challenges faced in object detection
1. localization problem
2. labeling the data

1 localization problem :
* One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method.
* An alternative is to build a sliding-window detector. considered adopting a sliding-window approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm.

2 labeling the data:
* The conventional solution to this problem is to use unsupervised pre-training, followed by supervise fine-tuning
* supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL),
* fine-tuning for detection improves mAP performance by 8 percentage points.
* Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs)

## Object detection with R-CNN
This system consists of three modules
* The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector.
* The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region.
* The third module is a set of class specific linear SVMs.

Module design

1 Region proposals
* which detect mitotic cells by applying a CNN to regularly-spaced square crops.
* use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute).
* the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU)

2 Feature extraction.
* extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN
* Features are computed by forward propagating a mean-subtracted 227x227 RGB image through five convolutional layers and two fully connected layers.
* warp all pixels in a tight bounding box around it to the required size
* The feature matrix is typically 2000x4096

3 Test time detection
* At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments).
* warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class.
* Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold.
## Training

1 Supervised pre-training:
* pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data)

2 Domain-specific fine-tuning.
* use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001.

3 Object category classifiers.
* use intersection-over union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3.
* Once features are extracted and training labels are applied, we optimize one linear SVM per class.
* adopt the standard hard negative mining method to fit large training data in memory.

### Results on PASCAL VOC 201012

1 VOC 2010
* compared against four strong baselines including SegDPM, DPM, UVA, Regionlets.
* Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster
https://i.imgur.com/0dGX9b7.png
2 ILSVRC2013 detection.
* ran R-CNN on the 200-class ILSVRC2013 detection dataset
* R-CNN achieves a mAP of 31.4%
https://i.imgur.com/GFbULx3.png
#### Performance layer-by-layer, without fine-tuning
1 pool5 layer
* which is the max pooled output of the network’s fifth and final convolutional layer.
*The pool5 feature map is 6 x6 x 256 = 9216 dimensional
* each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input

2 Layer fc6
* fully connected to pool5
* it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases

3 Layer fc7
* It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification
#### Performance layer-by-layer, with fine-tuning
* CNN’s parameters fine-tuned on PASCAL.
* fine-tuning increases mAP by 8.0 % points to 54.2%

### Network architectures
* 16-layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.
* RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0%
* drawback in terms of compute time, with in terms of compute time, with than T-Net.

1 The ILSVRC2013 detection dataset
* dataset is split into three sets: train (395,918), val (20,121), and test (40,152)

#### CNN features for segmentation.
* full R-CNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap.
* fg R-CNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction.
* full+fg R-CNN: The third strategy (full+fg) simply concatenates the full and fg features
https://i.imgur.com/n1bhmKo.png

1 Comments

arxiv.org
arxiv-vanity.com
scholar.google.com

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Alexander Jung 6 years ago

### What is BN:
  * Batch Normalization (BN) is a normalization method/layer for neural networks.
  * Usually inputs to neural networks are normalized to either the range of [0, 1] or [-1, 1] or to mean=0 and variance=1. The latter is called *Whitening*.
  * BN essentially performs Whitening to the intermediate layers of the networks.

### How its calculated:
  * The basic formula is $x^* = (x - E[x]) / \sqrt{\text{var}(x)}$, where $x^*$ is the new value of a single component, $E[x]$ is its mean within a batch and `var(x)` is its variance within a batch.
  * BN extends that formula further to $x^{**} = gamma * x^* +$ beta, where $x^{**}$ is the final normalized value. `gamma` and `beta` are learned per layer. They make sure that BN can learn the identity function, which is needed in a few cases.
  * For convolutions, every layer/filter/kernel is normalized on its own (linear layer: each neuron/node/component). That means that every generated value ("pixel") is treated as an example. If we have a batch size of N and the image generated by the convolution has width=P and height=Q, we would calculate the mean (E) over `N*P*Q` examples (same for the variance).

### Theoretical effects:
  * BN reduces *Covariate Shift*. That is the change in distribution of activation of a component. By using BN, each neuron's activation becomes (more or less) a gaussian distribution, i.e. its usually not active, sometimes a bit active, rare very active.
  * Covariate Shift is undesirable, because the later layers have to keep adapting to the change of the type of distribution (instead of just to new distribution parameters, e.g. new mean and variance values for gaussian distributions).
  * BN reduces effects of exploding and vanishing gradients, because every becomes roughly normal distributed. Without BN, low activations of one layer can lead to lower activations in the next layer, and then even lower ones in the next layer and so on.

### Practical effects:
  * BN reduces training times. (Because of less Covariate Shift, less exploding/vanishing gradients.)
  * BN reduces demand for regularization, e.g. dropout or L2 norm. (Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.)
  * BN allows higher learning rates. (Because of less danger of exploding/vanishing gradients.)
  * BN enables training with saturating nonlinearities in deep networks, e.g. sigmoid. (Because the normalization prevents them from getting stuck in saturating ranges, e.g. very high/low values for sigmoid.)


![MNIST and neuron activations](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Batch_Normalization__performance_and_activations.png?raw=true "MNIST and neuron activations")

*BN applied to MNIST (a), and activations of a randomly selected neuron over time (b, c), where the middle line is the median activation, the top line is the 15th percentile and the bottom line is the 85th percentile.*

-------------------------

### Rough chapter-wise notes

* (2) Towards Reducing Covariate Shift
  * Batch Normalization (*BN*) is a special normalization method for neural networks.
  * In neural networks, the inputs to each layer depend on the outputs of all previous layers.
  * The distributions of these outputs can change during the training. Such a change is called a *covariate shift*.
  * If the distributions stayed the same, it would simplify the training. Then only the parameters would have to be readjusted continuously (e.g. mean and variance for normal distributions).
  * If using sigmoid activations, it can happen that one unit saturates (very high/low values). That is undesired as it leads to vanishing gradients for all units below in the network.
  * BN fixes the means and variances of layer inputs to specific values (zero mean, unit variance).
  * That accomplishes:
    * No more covariate shift.
    * Fixes problems with vanishing gradients due to saturation.
  * Effects:
    * Networks learn faster. (As they don't have to adjust to covariate shift any more.)
    * Optimizes gradient flow in the network. (As the gradient becomes less dependent on the scale of the parameters and their initial values.)
    * Higher learning rates are possible. (Optimized gradient flow reduces risk of divergence.)
    * Saturating nonlinearities can be safely used. (Optimized gradient flow prevents the network from getting stuck in saturated modes.)
    * BN reduces the need for dropout. (As it has a regularizing effect.)
  * How BN works:
    * BN normalizes layer inputs to zero mean and unit variance. That is called *whitening*.
    * Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change.
    * A proper method has to include the current example *and* all previous examples in the normalization step.
    * This leads to calculating in covariance matrix and its inverse square root. That's expensive. The authors found a faster way.

* (3) Normalization via Mini-Batch Statistics
  * Each feature (component) is normalized individually. (Due to cost, differentiability.)
  * Normalization according to: `componentNormalizedValue = (componentOldValue - E[component]) / sqrt(Var(component))`
  * Normalizing each component can reduce the expressitivity of nonlinearities. Hence the formula is changed so that it can also learn the identiy function.
  * Full formula: `newValue = gamma * componentNormalizedValue + beta` (gamma and beta learned per component)
  * E and Var are estimated for each mini batch.
  * BN is fully differentiable. Formulas for gradients/backpropagation are at the end of chapter 3 (page 4, left).

* (3.1) Training and Inference with Batch-Normalized Networks
  * During test time, E and Var of each component can be estimated using all examples or alternatively with moving averages estimated during training.
  * During test time, the BN formulas can be simplified to a single linear transformation.

* (3.2) Batch-Normalized Convolutional Networks
  * Authors recommend to place BN layers after linear/fully-connected layers and before the ninlinearities.
  * They argue that the linear layers have a better distribution that is more likely to be similar to a gaussian.
  * Placing BN after the nonlinearity would also not eliminate covariate shift (for some reason).
  * Learning a separate bias isn't necessary as BN's formula already contains a bias-like term (beta).
  * For convolutions they apply BN equally to all features on a feature map. That creates effective batch sizes of m\*pq, where m is the number of examples in the batch and p q are the feature map dimensions (height, width). BN for linear layers has a batch size of m.
  * gamma and beta are then learned per feature map, not per single pixel. (Linear layers: Per neuron.)

* (3.3) Batch Normalization enables higher learning rates
  * BN normalizes activations.
  * Result: Changes to early layers don't amplify towards the end.
  * BN makes it less likely to get stuck in the saturating parts of nonlinearities.
  * BN makes training more resilient to parameter scales.
  * Usually, large learning rates cannot be used as they tend to scale up parameters. Then any change to a parameter amplifies through the network and can lead to gradient explosions.
  * With BN gradients actually go down as parameters increase. Therefore, higher learning rates can be used.
  * (something about singular values and the Jacobian)

* (3.4) Batch Normalization regularizes the model
  * Usually: Examples are seen on their own by the network.
  * With BN: Examples are seen in conjunction with other examples (mean, variance).
  * Result: Network can't easily memorize the examples any more.
  * Effect: BN has a regularizing effect. Dropout can be removed or decreased in strength.

* (4) Experiments
* (4.1) Activations over time
** They tested BN on MNIST with a 100x100x10 network. (One network with BN before each nonlinearity, another network without BN for comparison.)
** Batch Size was 60.
** The network with BN learned faster. Activations of neurons (their means and variances over several examples) seemed to be more consistent during training.
** Generalization of the BN network seemed to be better.

* (4.2) ImageNet classification
** They applied BN to the Inception network.
** Batch Size was 32.
** During training they used (compared to original Inception training) a higher learning rate with more decay, no dropout, less L2, no local response normalization and less distortion/augmentation.
** They shuffle the data during training (i.e. each batch contains different examples).
** Depending on the learning rate, they either achieve the same accuracy (as in the non-BN network) in 14 times fewer steps (5x learning rate) or a higher accuracy in 5 times fewer steps (30x learning rate).
** BN enables training of Inception networks with sigmoid units (still a bit lower accuracy than ReLU).
** An ensemble of 6 Inception networks with BN achieved better accuracy than the previously best network for ImageNet.

* (5) Conclusion
** BN is similar to a normalization layer suggested by Gülcehre and Bengio. However, they applied it to the outputs of nonlinearities.
** They also didn't have the beta and gamma parameters (i.e. their normalization could not learn the identity function).

arxiv.org
scholar.google.com

The Limitations of Deep Learning in Adversarial Settings
Nicolas Papernot and Patrick McDaniel and Somesh Jha and Matt Fredrikson and Z. Berkay Celik and Ananthram Swami
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CR, cs.LG, cs.NE, stat.ML
more

[link] Summary by David Stutz 6 years ago

Papernot et al. Introduce a novel attack on deep networks based on so-called adversarial saliency maps that are computed independently of a loss. Specifically, they consider – for a given network $F(X)$ – the forward derivative

$\nabla F = \frac{\partial F}{\partial X} = \left[\frac{\partial F_j(X)}{\partial x_i}\right]_{i,j}$.

Essentially, this is the regular derivative of $F$ with respect to its input; Papernot et al. seem to refer to is as “forward” derivative as it stands in contrast with regular backpropagation where the derivative of the loss with respect to the parameters is considered. They define an adversarial saliency map by considering

$S(X, t)_i = \begin{cases}0 & \text{ if } \frac{\partial F_t(X)}{\partial X_i} < 0 \text{ or } \sum_{j\neq t} \frac{\partial F_j(X)}{\partial X_i} > 0\\ \left(\frac{\partial F_t(X)}{\partial X_i}\right) \left| \sum_{j \neq t} \frac{\partial F_j(X)}{\partial X_i}\right| & \text{ otherwise}\end{cases}$

where $t$ is the target class of the attack. The intuition of this definition is the following: The partial derivative of $F_t$ with respect to $X$ at location $i$ indicates how $X_i$ can be changed in order to increase $F_t$ (which is the goal). At the same time, $F_j$ for all $t \neq j$ is supposed to decrease for the targeted attack, this is implemented using the second (absolute) term. If, at a specific feature $X_i$, not increase of $X_i$ will lead to an increase of $F_t$, or an increase will also lead to an increase in the other $F_j$, the saliency map is zero – indicating that feature $i$ is useless. Note that here, only increases in $X_i$ are considered; Papernot et al. have a analogous formulation for considering decreases of $X_i$.
Based on the concept of adversarial saliency maps, a simple attack is implemented as illustrated in Algorithm 1. In particular, the feature $X_i$ for which the saliency map $S(X, t)$ is maximized is chosen and increased by a fixed amount until the network $F$ changes the label to $t$ or a maximum perturbation is reached (in which case the attack fails).

https://i.imgur.com/PvJv9yS.png
Algorithm 1: The proposed algorithm for generating adversarial examples, see text for details.

In experiments on MNIST they show the effectiveness of the proposed attack. Additionally, they attempt to quantify the robustness (called “hardness”) of specific classes. In particular, they show that some classes are harder to attack than others. To this end they derive the so-called adversarial distance

$A(X, t) = 1 - \frac{1}{M}\sum_i 1_{[S(X, t)_i > 0]}$

which counts the number of features in the adversarial saliency map that are greater than zero (i.e. can be perturbed during the attack in Algorithm 1). Personally, I find this “hardness” measure quite interesting because it is independent of a specific loss, but directly takes statistics of the learned model into account.

Also see this summary on [davidstutz.de](https://davidstutz.de/category/reading/).