ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CL
more

[link] Summary by CodyWild 5 years ago

The last two years have seen a number of improvements in the field of language model pretraining, and BERT - Bidirectional Encoder Representations from Transformers - is the most recent entry into this canon. The general problem posed by language model pretraining is: can we leverage huge amounts of raw text, which aren’t labeled for any specific classification task, to help us train better models for supervised language tasks (like translation, question answering, logical entailment, etc)? Mechanically, this works by either 1) training word embeddings and then using those embeddings as input feature representations for supervised models, or 2) treating the problem as a transfer learning problem, and fine-tune to a supervised task - similar to how you’d fine-tune a model trained on ImageNet by carrying over parameters, and then training on your new task. Even though the text we’re learning on is strictly speaking unsupervised (lacking a supervised label), we need to design a task on which we calculate gradients in order to train our representations. For unsupervised language modeling, that task is typically structured as predicting a word in a sequence given prior words in that sequence. Intuitively, in order for a model to do a good job at predicting the word that comes next in a sentence, it needs to have learned patterns about language, both on grammatical and semantic levels. A notable change recently has been the shift from learning unconditional word vectors (where the word’s representation is the same globally) to contextualized ones, where the representation of the word is dependent on the sentence context it’s found in. All the baselines discussed here are of this second type. 

The two main baselines that the BERT model compares itself to are OpenAI’s GPT, and Peters et al’s ELMo. The GPT model uses a self-attention-based Transformer architecture, going through each word in the sequence, and predicting the next word by calculating an attention-weighted representation of all prior words. (For those who aren’t familiar, attention works by multiplying a “query” vector with every word in a variable-length sequence, and then putting the outputs of those multiplications into a softmax operator, which inherently gets you a weighting scheme that adds to one). ELMo uses models that gather context in both directions, but in a fairly simple way: it learns one deep LSTM that goes from left to right,  predicting word k using words 0-k-1, and a second LSTM that goes from right to left, predicting word k using words k+1 onward. These two predictions are combined (literally: just summed together) to get a representation for the word at position k. 

https://i.imgur.com/2329e3L.png

BERT differs from prior work in this area in several small ways, but one primary one: instead of representing a word using only information from words before it, or a simple sum of prior information and subsequent information, it uses the full context from before and after the word in each of its multiple layers. It also uses an attention-based Transformer structure, but instead of incorporating just prior context, it pulls in information from the full sentence. To allow for a model that actually uses both directions of context at a time in its unsupervised prediction task, the authors of BERT slightly changed the nature of that task: it replaces the word being predicted with the “mask” token, so that even with multiple layers of context aggregation on both sides, the model doesn’t have any way of knowing what the token is. By contrast, if it weren’t masked, after the first layer of context aggregation, the representations of other words in the sequence would incorporate information about the predicted word k, making it trivial, if another layer were applied on top of that first one, for the model to directly have access to the value it’s trying to predict. This problem can either be solved by using multiple layers, each of which can only see prior context (like GPT), by learning fully separate L-R and R-L models, and combining them at the final layer (like ELMo) or by masking tokens, and predicting the value of the masked tokens using the full remainder of the context. 

This task design crucially allows for a multi-layered bidirectional architecture, and consequently a much richer representation of context in each word’s pre-trained representation. BERT demonstrates dramatic improvements over prior work when fine tuned on a small amount of supervised data, suggesting that this change added substantial value.

arxiv.org
scholar.google.com

Mask R-CNN
He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 7 years ago

Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN).

Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool.

#### Methodology

Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class.

##### Major Changes and intutions

**Mask prediction**

Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask.

Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation

**RoIAlign**

RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of  quantization of the RoI boundaries
or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average).

**Backbone architecture**

Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) 

**Training Objective**

The training objective looks like this 
![](https://i.imgur.com/snUq73Q.png)

Lmask is the addition from Faster RCNN. The method to calculate was mentioned above

#### Observation

Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper

arxiv.org
arxiv-vanity.com
scholar.google.com

A Deep Reinforced Model for Abstractive Summarization
Romain Paulus and Caiming Xiong and Richard Socher
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Udibr 6 years ago

Generates abstractive summaries from news articles. Also see [blog](https://metamind.io/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization)
* Input:
 * vocab size 150K
 * start with $W_\text{emb}$ Glove 100
* Seq2Seq:
  * bidirectional LSTM, `size=200` in each direction. Final hidden states are concatenated and feed as initial hidden state of the decoder an LSTM of `size=400`. surprising it's only one layer.
* Attention:
  * Add standard attention mechanism between each new hidden state of the decoder and all the hidden states of the encoder
  * A new kind of attention mechanism is done between the new hidden state of the decoder and all previous hidden states of the decoder
  * the new hidden state is concatenated with the two attention outputs and feed to dense+softmax to model next word in summary (output vocab size 50K). The weight matrix $W_h$ is reduced to $W_h = \tanh \left( W_\text{emb} W_\text{proj} \right) $ resulting in faster converges, see [1](arXiv:1611.01462) and [2](https://arxiv.org/abs/1608.05859)
* Pointer mechanism:
  * The concatenated values are also feed to logistic classifier to decide if the softmax output should be used or one of the words in the article should be copied to the output. The article word to be copied is selected using same weights computed in the attention mechanism
* Loss
  * $L_\text{ml}$: NLL of the example summary $y^*$. If only $L_\text{ml}$ is used then 25% of the times use generated instead of given sample as input to next step. 
  * $L_\text{rl}$: sample an entire summary from the model $y^s$ (temperature=1) and the loss is the NLL of the sample multiplied by a reward. The reward is $r(y^s)-r(\hat{y})$ where $r$ is ROUGE-L and $\hat{y}$ is a generated greedy sequences
 * $L=\gamma L_\text{rl} + (1-\gamma)L_\text{ml}$ where $\gamma=0.9984$
* Training
  * `batch=50`, Adam,  `LR=1e-4` for RL/ML+RL training
  * The training labels are summary examples and an indication if copy was used in the pointer mechanism and which word was copied. This is indicated when the summary word is OOV or if it appears in the article and its NER is one of PERSON, LOCATION, ORGANIZATION or MISC
* Generation
  * 5 beams
  * force trigrams not to appear twice in the same beam

arxiv.org
scholar.google.com

Reverse Classification Accuracy: Predicting Segmentation Performance in the Absence of Ground Truth
Valindria, Vanya V. and Lavdas, Ioannis and Bai, Wenjia and Kamnitsas, Konstantinos and Aboagye, Eric O. and Rockall, Andrea G. and Rueckert, Daniel and Glocker, Ben
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 7 years ago

#### Idea
Reverse Classification Accuracy (RCA) models are aims to answer the question on how to estimate performance of models (semantic segmentation models were explained in the paper) in cases where ground truth is not available.

#### Why is it important

Before deployment, performance is quantified using different metrics, for which the predicted segmentation is compared to a reference segmentation, often obtained manually by an expert. But little is known about the real performance after deployment when a reference is unavailable. RCA aims to quantify the performance in those deployment scenarios

#### Methodology
The RCA model pipeline follows a simple enough pipeline for the same:

1. Train a model M on training dataset T containing input images and ground truth {**I**,**G**}
2. Use M to predict segmentation map for an input image II to get segmentation map SS
3. Train a RCA model that uses input image II to predict SS. As it's a single datapoint for the model it would overfit. There's no validation set for the RCA model
4. Test the performance of RCA model on Images which have ground truth G and the best performance of the model is an indicator of the performance (DSC - Dice Similarity Coefficient) of how the original image would perform on a new image whose ground truth is not available to compute segmentation accuracy (DSC)

#### Observation
For validation of the RCA method, the predicted DSC and the real DSC were compared and the correlation between the 2 was calculated. For all calculations 3 types of methods of segmentation were used and 3 slightly different types methods for RCA were used for comparison. The predicted DSC and real DSC were highly correlated for most of the cases.

Here's a snap of the results that they obtained

![](http://i.imgur.com/2ra0wQm.png)

dx.doi.org
sci-hub
scholar.google.com

SSD: Single Shot MultiBox Detector
Liu, Wei and Anguelov, Dragomir and Erhan, Dumitru and Szegedy, Christian and Reed, Scott E. and Fu, Cheng-Yang and Berg, Alexander C.
European Conference on Computer Vision - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 7 years ago

SSD aims to solve the major problem with most of the current state of the art object detectors namely Faster RCNN and like. All the object detection algortihms have same  methodology 

- Train 2 different nets - Region Proposal Net (RPN) and advanced classifier to detect class of an object and bounding box separately.
- During inference, run the test image at different scales to detect object at multiple scales to account for invariance

This makes the nets extremely slow. Faster RCNN could operate at **7 FPS with 73.2% mAP** while SSD could achieve **59 FPS with 74.3% mAP ** on VOC 2007 dataset.

#### Methodology

SSD uses a single net for predict object class and bounding box. However it doesn't do that directly. It uses a mechanism for choosing ROIs, training end-to-end for predicting class and boundary shift for that ROI.

##### ROI selection

Borrowing from FasterRCNNs SSD uses the  concept of anchor boxes for generating ROIs from the feature maps of last layer of shared conv layer. For each pixel in layer of feature maps, k default boxes with different aspect ratios are chosen around every pixel in the map. So if there are feature maps each of m x n resolutions - that's *mnk* ROIs for a single feature layer. Now SSD uses multiple feature layers (with differing resolutions) for generating such ROIs primarily to capture size invariance of objects. But because earlier layers in deep conv net tends to capture low level features, it uses features after certain levels and layers henceforth.

##### ROI labelling
Any ROI that matches to Ground Truth for a class after applying appropriate transforms and having Jaccard overlap greater than 0.5 is positive. Now, given all feature maps are at different resolutions and each boxes are at different aspect ratios, doing that's not simple. SDD uses simple scaling and aspect ratios to get to the appropriate  ground truth dimensions for calculating Jaccard overlap for default boxes for each pixel at the given resolution

##### ROI classification

SSD uses single convolution kernel of 3*3 receptive fields to predict for each ROI the 4 offsets (centre-x offset, centre-y offset, height offset , width offset) from the Ground Truth box for each RoI, along with class confidence scores for each class. So that is if there are c classes (including background), there are (c+4) filters for each convolution kernels that looks at a ROI. 

So summarily we have convolution kernels  that look at ROIs (which are default boxes around each pixel in feature map layer) to generate (c+4) scores for each RoI. Multiple feature map layers with different resolutions are used for generating such ROIs. Some ROIs are positive and some negative depending on jaccard overlap after ground box has scaled appropriately taking resolution differences in input image and feature map into consideration.

Here's how it looks :
![](https://i.imgur.com/HOhsPZh.png)

##### Training

For each ROI a combined loss is calculated as a combination of localisation error and classification error. The details are best explained in the figure. 
![](https://i.imgur.com/zEDuSgi.png)

##### Inference
For each ROI predictions a small threshold is used to first filter out irrelevant predictions, Non Maximum Suppression (nms) with jaccard overlap of 0.45 per class is applied then on the remaining candidate ROIs and the top 200 detections per image are kept.

For further understanding of the intuitions regarding the paper and the results obtained please consider giving the full paper a read. 

The open sourced code is available at this [Github repo](https://github.com/weiliu89/caffe/tree/ssd)