![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN). Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool. #### Methodology Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class. ##### Major Changes and intutions **Mask prediction** Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask. Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation **RoIAlign** RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). **Backbone architecture** Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) **Training Objective** The training objective looks like this  Lmask is the addition from Faster RCNN. The method to calculate was mentioned above #### Observation Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper ![]() |
[link]
Szegedy et al. were (to the best of my knowledge) the first to describe the phenomen of adversarial examples as researched today. Specifically, they described the main objective in order to obtain adversarial examples as $\arg\min_r \|r\|_2$ s.t. $f(x+r)=l$ and $x+r$ being a valid image where $f$ is the neural network and $l$ the target class (i.e. targeted adversarial example). In the paper, they originally headlined the section by “blind spots in neural networks”. While they give some explanation and provide experiments, also introducing the notion of transferability of adversarial examples and an idea of adversarial examples used as regularization during training, many questions are left open. The given conclusion, that these adversarial examples are highly unlikely and that these examples lie dense within regular training examples are controversial in the literature. ![]() |
[link]
#### Introduction * Introduces fastText, a simple and highly efficient approach for text classification. * At par with deep learning models in terms of accuracy though an order of magnitude faster in performance. * [Link to the paper](http://arxiv.org/abs/1607.01759v3) * [Link to code](https://github.com/facebookresearch/fastText) #### Architecture * Built on top of linear models with a rank constraint and a fast loss approximation. * Start with word representations that are averaged into text representation and feed them to a linear classifier. * Think of text representation as a hidden state that can be shared among features and classes. * Softmax layer to obtain a probability distribution over pre-defined classes. * High computational complexity $O(kh)$, $k$ is the number of classes and $h$ is dimension of text representation. ##### Hierarchial Softmax * Based on Huffman Coding Tree * Used to reduce complexity to $O(hlog(k))$ * Top T results (from the tree) can be computed efficiently $O(logT)$ using a binary heap. ##### N-gram Features * Instead of explicitly using word order, uses a bag of n-grams to maintain efficiency without losing on accuracy. * Uses [hashing trick](https://arxiv.org/pdf/0902.2206.pdf) to maintain fast and memory efficient mapping of the n-grams. #### Experiments ##### Sentiment Analysis * fastText benefits by using bigrams. * Outperforms [char-CNN](http://arxiv.org/abs/1502.01710v5) and [char-CRNN](http://arxiv.org/abs/1602.00367v1) and performs a bit worse than [VDCNN](http://arxiv.org/abs/1606.01781v1). * Order of magnitudes faster in terms of training time. * Note: fastText does not use pre-trained word embeddings. ##### Tag Prediction * fastText with bigrams outperforms [Tagspace](http://emnlp2014.org/papers/pdf/EMNLP2014194.pdf). * fastText performs upto 600 times faster at test time. ![]() |
[link]
The last two years have seen a number of improvements in the field of language model pretraining, and BERT - Bidirectional Encoder Representations from Transformers - is the most recent entry into this canon. The general problem posed by language model pretraining is: can we leverage huge amounts of raw text, which aren’t labeled for any specific classification task, to help us train better models for supervised language tasks (like translation, question answering, logical entailment, etc)? Mechanically, this works by either 1) training word embeddings and then using those embeddings as input feature representations for supervised models, or 2) treating the problem as a transfer learning problem, and fine-tune to a supervised task - similar to how you’d fine-tune a model trained on ImageNet by carrying over parameters, and then training on your new task. Even though the text we’re learning on is strictly speaking unsupervised (lacking a supervised label), we need to design a task on which we calculate gradients in order to train our representations. For unsupervised language modeling, that task is typically structured as predicting a word in a sequence given prior words in that sequence. Intuitively, in order for a model to do a good job at predicting the word that comes next in a sentence, it needs to have learned patterns about language, both on grammatical and semantic levels. A notable change recently has been the shift from learning unconditional word vectors (where the word’s representation is the same globally) to contextualized ones, where the representation of the word is dependent on the sentence context it’s found in. All the baselines discussed here are of this second type. The two main baselines that the BERT model compares itself to are OpenAI’s GPT, and Peters et al’s ELMo. The GPT model uses a self-attention-based Transformer architecture, going through each word in the sequence, and predicting the next word by calculating an attention-weighted representation of all prior words. (For those who aren’t familiar, attention works by multiplying a “query” vector with every word in a variable-length sequence, and then putting the outputs of those multiplications into a softmax operator, which inherently gets you a weighting scheme that adds to one). ELMo uses models that gather context in both directions, but in a fairly simple way: it learns one deep LSTM that goes from left to right, predicting word k using words 0-k-1, and a second LSTM that goes from right to left, predicting word k using words k+1 onward. These two predictions are combined (literally: just summed together) to get a representation for the word at position k. https://i.imgur.com/2329e3L.png BERT differs from prior work in this area in several small ways, but one primary one: instead of representing a word using only information from words before it, or a simple sum of prior information and subsequent information, it uses the full context from before and after the word in each of its multiple layers. It also uses an attention-based Transformer structure, but instead of incorporating just prior context, it pulls in information from the full sentence. To allow for a model that actually uses both directions of context at a time in its unsupervised prediction task, the authors of BERT slightly changed the nature of that task: it replaces the word being predicted with the “mask” token, so that even with multiple layers of context aggregation on both sides, the model doesn’t have any way of knowing what the token is. By contrast, if it weren’t masked, after the first layer of context aggregation, the representations of other words in the sequence would incorporate information about the predicted word k, making it trivial, if another layer were applied on top of that first one, for the model to directly have access to the value it’s trying to predict. This problem can either be solved by using multiple layers, each of which can only see prior context (like GPT), by learning fully separate L-R and R-L models, and combining them at the final layer (like ELMo) or by masking tokens, and predicting the value of the masked tokens using the full remainder of the context. This task design crucially allows for a multi-layered bidirectional architecture, and consequently a much richer representation of context in each word’s pre-trained representation. BERT demonstrates dramatic improvements over prior work when fine tuned on a small amount of supervised data, suggesting that this change added substantial value. ![]() |
[link]
Meta learning, or, the idea of training models on some distribution of tasks, with the hope that they can then learn more quickly on new tasks because they have “learned how to learn” similar tasks, has become a more central and popular research field in recent years. Although there is a veritable zoo of different techniques (to an amusingly literal degree; there’s an emergent fad of naming new methods after animals), the general idea is: have your inner loop consist of training a model on some task drawn from a distribution over tasks (be that maze learning with different wall configurations, letter identification from different languages, etc), and have the outer loop that updates some structural part of your model be based on improving generalization error on each task within the distribution. It’s been demonstrated that meta-learned systems can in fact learn more quickly (at least when their tasks are “in distribution” relative to the distribution they were trained on, which is an important point to be cognizant of), but this paper is less interested with how much better or faster they’re learning, and more interested in whether there are qualitative differences in the way normal learning systems and meta-trained learning systems go about learning a new task. The author (oddly for DeepMind, which typically goes in for super long author lists, there’s only the one on this paper) goes about this by studying simple learning tasks where it’s easier for us to introspect into what each model is learning over time. https://i.imgur.com/ceycq46.png In the first test, he looks at linear regression in a simple setting: where for each individual “task” data is generated according a known true weight matrix (sampled from a prior over weight matrices), with some noise added in. Given this weight matrix, he takes the singular value decomposition (think: PCA), and so ends up with a factorized representation of the weights, where higher eigenvalues on the factors, or “modes”, represent that those factors represent larger-scale patterns that explain more variance, and lower eigenvalues are smaller scale refinements on top of that. He can apply this same procedure to the weights the network has learned at any given point in training, and compare, to see how close the network is to having correctly captured each of these different modes. When normal learners (starting from a raw initialization) approach the task, they start by matching the large scale (higher eigenvalue) factors of variation, and then over the course of training improve performance on the higher-precision factors. By contrast, meta learners, in addition to learning faster, also learn large scale and small scale modes at the same rate. Similar analysis was performed and similar results found for nonlinear regression, where instead of PCA-style components, the function generating data were decomposed into different Fourier frequencies, and the normal learner learned the broad, low-frequency patterns first, where the meta learner learned them all at the same rate. The paper finds intuition for this by showing that the behavior of the meta learners matches quite well against how a Bayes-optimal learner would update on new data points, in the world where that learner had a prior over the data-generating weights that matched the true generating process. So, under this framing, the process of meta learning is roughly equivalent to your model learning a prior correspondant with the task distribution it was trained on. This is, at a high level, what I think we all sort of thought was happening with meta learning, but it’s pretty neat to see it laid out in a small enough problem where we can actually validate against an analytic model. A bit of a meta (heh) point: I wish this paper had more explanation of why the author chose to use the specific eigenvalue-focused metrics of progression on task learning that he did. They seem reasonable, but I’d have been curious to see an explication of what is captured by these, and what might be captured by alternative metrics of task progress. (A side note: the paper also contained a reinforcement learning experiment, but I both understood that one less well and also feel like it wasn’t really that analogous to the other tests) ![]() |