ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Teaching Machines to Read and Comprehend
Hermann, Karl Moritz and Kociský, Tomás and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper explores the problem of question answering based on natural text. While this has been explored recently in the context of Memory Networks, the problems tackled so far have been synthetically generated. In this paper, the authors propose to extract from news sites more realistic question answering examples, by treating the main body of a news article as the content (the "facts") and extracting questions from the article's bullet point summaries. Specifically, by detecting the entities in these bullet points and replacing them with a question place older (e.g. "Producer X will not press charges"), they are able to generate queries which, while grammatically not being questions, do require to perform a form of question answering. Thanks to this procedure, two large *supervised* datasets are created, with several thousands of questions, based on the CNN and Daily Mail news sites.

Then, the authors investigate neural network based systems for solving this task. They consider a fairly simple Deep LSTM network, which is first fed the article's content and then the query. They also consider two architectures that incorporate an attentional mechanism, based on softmax weighting. The first ("Attentive Reader") attends once in the document (i.e. uses a single softmax weight vector) while the second ("Impatient Reader") attends after every word in the query (akin to the soft attention architecture in the "Show Attend and Tell" paper).

These neural network architectures are also compared with simpler baselines, which are closer to what a more "classical" statistical NLP solution might look like.

Results on both datasets demonstrate that the neural network approaches have superior performance, with the attentional models being significantly better than the simpler Deep LSTM model.

#### My two cents

This is welcome development in the research on reasoning models based on neural networks. I've always thought it was unfortunate that the best benchmark available is based on synthetically generated cases. This work fixes this problem in a really clever way, while still being able to generate a large amount of training data. Particularly clever is the random permutation of entity markers when processing each case. Thanks to that, a system cannot simply use general statistics on words to answer questions (e.g. just from the query "The hi-tech bra that helps you beat breast X" it's obvious that "cancer" is an excellent answer). In this setup, the system is forced to exploit the content of the article, thus ensuring that the benchmark is indeed measuring the system's question-answering abilities.

Since the dataset itself is an important contribution of this paper, I hope the authors release it publicly in the near future.

The evaluation of the different neural architectures is also really thoroughly done. The non-neural baselines are reasonable and the comparison between the neural nets is itself interesting, bringing more evidence that the softmax weighted attentional mechanism (which has been gaining in popularity) indeed brings something over a regular LSTM approach.

dx.doi.org
sci-hub
scholar.google.com

Training Deep and Recurrent Networks with Hessian-Free Optimization
Martens, James and Sutskever, Ilya
Springer Neural Networks: Tricks of the Trade (2nd ed.) - 2012 via Local Bibsonomy
Keywords: dblp

[link] Summary by Raza Habib 8 years ago

## Very Short Summary
The authors introduce a number of modifications to traditional hessian-free optimisation that makes the method work better for neural networks. The modifications are:
* Use the Generalised Gauss Newton Matrix (GGN) rather than the Hessian.
* Damp the GGN so that $G' = G + \lambda I$ and adjust $\lambda$ using levenberg-marquardt heuristic.
* Use an efficient recursion to calculate the GGN.
* Initialise each round of conjugated gradients with the final vector of the previous iteration.
* A new simpler termination criterion for CG. Terminate CG when the relative decrease in the objective falls below some threshold.
* Back-tracking of the CG solution. ie you store intermediate solutions to CG and only update if the new CG solution actually decreases the over all problem objective.

## Less Short Summary

### Hessian Free Optimisation in General
Hessian free optimisation is used when one wishes to optimise some objective $f(\theta)$ using second order methods but inversion or even computation of the Hessian is intractable or infeasible. The method is an iterative method and at each iteration, we take a second order approximation to the objective. i.e at iterantion n, we take a second order taylor expansion of $f$ to get:

$M^n(\theta) = f(\theta^n) + \nabla_{\theta}^Tf(\theta^n)(\theta - \theta^n) + (\theta - \theta^n)^TH(\theta - \theta^n) $

Where $H$ is the hessian matrix. If we minimise this second order approximation with respect to $\theta$ we would find that that $\theta^{n+1} = H^{-1}(-\nabla_{\theta}^Tf(\theta^n))$. However, inverting $H$ is usually not possible for even moderately sized neural networks.

There does however exist an efficient algorithm for calculating hessian vector products $Hv$ for any $v$. The insight of hessian-free optimisation is that one can solve linear problems of the form $Hx = v$ using only hessian vector products via the linear conjugated gradients algorithm. You therefore avoid the need to ever actually compute either the Hessian or its inverse.

To run vanilla hessian free all you need to do at each iteration is:

1) Calculate the gradient vector using standard backprop.

2) Calculate $H\theta$ product using an efficient recursion.

3) calculate the next update $\theta^{n+1} = ConjugatedGradients(H, -\nabla_{\theta}^Tf(\theta^n))$

The main contribution of this paper is to take the above algorithm and make the changes outlined in the very short summary.

## Take aways
Hessian-Free optimisation was perhaps the best method at the time of publication. Recently it seems that first order methods using per-parameter learning rates like ADAM or even learning-to-learn can outperform Hessian-Free. This is primarily because of the increased cost per iteration of Hessian Free. However it still seems that using curvature information if its available is beneficial though expensive.

More resent second order curvature appoximations like Kroeniker Factored Approximate Curvature (KFAC) and Kroeniker Factored Recursive Approximation (KFRA) are cheaper ways to achieve the same benefit.

dx.doi.org
sci-hub
scholar.google.com

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Girshick, Ross B. and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra
Conference and Computer Vision and Pattern Recognition - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by nandini 8 years ago

# Object detection system overview.

https://i.imgur.com/vd2YUy3.png

1. takes an input image,
2. extracts around 2000 bottom-up region proposals,
3. computes features for each proposal using a large convolutional neural network (CNN), and then
4. classifies each region using class-specific linear SVMs.
* R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010.
* On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%.

## There is a 2 challenges faced in object detection
1. localization problem
2. labeling the data

1 localization problem :
* One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method.
* An alternative is to build a sliding-window detector. considered adopting a sliding-window approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm.

2 labeling the data:
* The conventional solution to this problem is to use unsupervised pre-training, followed by supervise fine-tuning
* supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL),
* fine-tuning for detection improves mAP performance by 8 percentage points.
* Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs)

## Object detection with R-CNN
This system consists of three modules
* The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector.
* The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region.
* The third module is a set of class specific linear SVMs.

Module design

1 Region proposals
* which detect mitotic cells by applying a CNN to regularly-spaced square crops.
* use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute).
* the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU)

2 Feature extraction.
* extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN
* Features are computed by forward propagating a mean-subtracted 227x227 RGB image through five convolutional layers and two fully connected layers.
* warp all pixels in a tight bounding box around it to the required size
* The feature matrix is typically 2000x4096

3 Test time detection
* At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments).
* warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class.
* Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold.
## Training

1 Supervised pre-training:
* pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data)

2 Domain-specific fine-tuning.
* use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001.

3 Object category classifiers.
* use intersection-over union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3.
* Once features are extracted and training labels are applied, we optimize one linear SVM per class.
* adopt the standard hard negative mining method to fit large training data in memory.

### Results on PASCAL VOC 201012

1 VOC 2010
* compared against four strong baselines including SegDPM, DPM, UVA, Regionlets.
* Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster
https://i.imgur.com/0dGX9b7.png
2 ILSVRC2013 detection.
* ran R-CNN on the 200-class ILSVRC2013 detection dataset
* R-CNN achieves a mAP of 31.4%
https://i.imgur.com/GFbULx3.png
#### Performance layer-by-layer, without fine-tuning
1 pool5 layer
* which is the max pooled output of the network’s fifth and final convolutional layer.
*The pool5 feature map is 6 x6 x 256 = 9216 dimensional
* each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input

2 Layer fc6
* fully connected to pool5
* it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases

3 Layer fc7
* It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification
#### Performance layer-by-layer, with fine-tuning
* CNN’s parameters fine-tuned on PASCAL.
* fine-tuning increases mAP by 8.0 % points to 54.2%

### Network architectures
* 16-layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.
* RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0%
* drawback in terms of compute time, with in terms of compute time, with than T-Net.

1 The ILSVRC2013 detection dataset
* dataset is split into three sets: train (395,918), val (20,121), and test (40,152)

#### CNN features for segmentation.
* full R-CNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap.
* fg R-CNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction.
* full+fg R-CNN: The third strategy (full+fg) simply concatenates the full and fg features
https://i.imgur.com/n1bhmKo.png

1 Comments

aclweb.org
scholar.google.com

Variational Neural Machine Translation
Zhang, Biao and Xiong, Deyi and Su, Jinsong and Duan, Hong and Zhang, Min
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 8 years ago

They start with the neural machine translation model using alignment, by Bahdanau et al. (2014), and add an extra variational component.

https://i.imgur.com/6yIEbDf.png

The authors use two neural variational components to model a distribution over latent variables z that captures the semantics of a sentence being translated. First, they model the posterior probability of z, conditioned on both input and output. Then they also model the prior of z, conditioned only on the input. During training, these two distributions are optimised to be similar using Kullback-Leibler distance, and during testing the prior is used. They report improvements on Chinese-English and English-German translation, compared to using the original encoder-decoder NMT framework.

arxiv.org
scholar.google.com

Using Deep Learning for Segmentation and Counting within Microscopy Data
Hernández, Carlos X. and Sultan, Mohammad M. and Pande, Vijay S.
arXiv e-Print archive - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by Oleksandr Bailo 7 years ago

Given microscopy cell data, this work tries to determine the number of cells in the image. The whole pipeline is composed of two steps:
1. Cells segmentation:
    - Feature Pyramid Network is used for generating a foreground mask;
    - The last output of FPN is used for predicting mean foreground masks and aleatoric uncertainty masks. Each mask in both outputs is trained with aleatoric loss $ \frac{||y_{pred} - y_{gt} ||^2}{2\sigma} + \log{2\sigma}$  and [total-variational](https://en.wikipedia.org/wiki/Total_variation_denoising) loss. https://i.imgur.com/ssTuGVe.png
2. Cell counting:
    - VGG-11 network is used as a feature extractor from the predicted foreground segmentation masks. There are two output branches following VGG: cell count branch and estimated variance branch. Training is done using L2 loss function with aleatoric uncertainty for cell counts.
https://i.imgur.com/aijZn7e.png

While the idea to utilize neural networks to count cells in the image seems fascinating, the real benefit of such system in production is quite questionable. Specifically, why would you need to add a VGG-like feature extractor on top of already predicted cell segmentation masks, if you could simply do more work in segmentation network (i.e. separate cells better, predict objectness/contour) and get the number of cells directly from the predicted masks?