ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Mask R-CNN
He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 7 years ago

Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN).

Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool.

#### Methodology

Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class.

##### Major Changes and intutions

**Mask prediction**

Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask.

Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation

**RoIAlign**

RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of  quantization of the RoI boundaries
or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average).

**Backbone architecture**

Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) 

**Training Objective**

The training objective looks like this 
![](https://i.imgur.com/snUq73Q.png)

Lmask is the addition from Faster RCNN. The method to calculate was mentioned above

#### Observation

Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper

dx.doi.org
sci-hub
scholar.google.com

Near-optimal probabilistic RNA-seq quantification
Nicolas L Bray and Harold Pimentel and Páll Melsted and Lior Pachter
Nature Biotechnology - 2016 via Local CrossRef
Keywords:

[link] Summary by Geneviève Boucher 6 years ago

This paper from 2016 introduced a new k-mer based method to estimate isoform abundance from RNA-Seq data called kallisto.  The method provided a significant improvement in speed and memory usage compared to the previously used methods while yielding similar accuracy.   In fact, kallisto is able to quantify expression in a matter of minutes instead of hours.

The standard (previous) methods for quantifying expression rely on mapping, i.e. on the alignment of a transcriptome sequenced reads to a genome of reference.  Reads are assigned to a position in the genome and the gene or isoform expression values are derived by counting the number of reads overlapping the features of interest. 

The idea behind kallisto is to rely on a pseudoalignment which does not attempt to identify the positions of the reads in the transcripts, only the potential transcripts of origin. Thus,  it avoids doing an alignment of each read to a reference genome. In fact, kallisto only uses the transcriptome sequences (not the whole genome) in its first step which is the generation of  the kallisto index.  Kallisto builds a colored de Bruijn graph (T-DBG) from all the k-mers found in the transcriptome.  Each node of the graph corresponds to a k-mer (a short sequence of k nucleotides) and retains the information about the transcripts in which they can be found in the form of a color.  Linear stretches having the same coloring in the graph correspond to transcripts. Once the T-DBG is built, kallisto stores a hash table mapping each k-mer to its transcript(s) of origin along with the position within the transcript(s).  This step is done only once and is dependent on a provided annotation file (containing the sequences of all the transcripts in the transcriptome).  
  
Then for a given sequenced sample, kallisto decomposes each read into its k-mers and uses those k-mers to find a path covering in the T-DBG.  This path covering of the transcriptome graph, where a path corresponds to a transcript, generates k-compatibility classes for each k-mer, i.e. sets of potential transcripts of origin on the nodes.   The potential transcripts of origin for a read can be obtained using the intersection of its k-mers k-compatibility classes. To make the pseudoalignment faster, kallisto removes redundant k-mers since neighboring k-mers often belong to the same transcripts. Figure1, from the paper, summarizes these different steps.

https://i.imgur.com/eNH2kuO.png

**Figure1**. Overview of kallisto. The input consists of a reference transcriptome and reads from an RNA-seq experiment. (a) An example of a read (in black) and three overlapping transcripts with exonic regions as shown. (b) An index is constructed by creating the transcriptome de Bruijn Graph (T-DBG) where nodes (v1, v2, v3, ... ) are k-mers, each transcript corresponds to a colored path as shown and the path cover of the transcriptome induces a k-compatibility class for each k-mer. (c) Conceptually, the k-mers of a read are hashed (black nodes) to find the k-compatibility class of a read. (d) Skipping (black dashed lines) uses the information stored in the T-DBG to skip k-mers that are redundant because they have the same k-compatibility class. (e) The k-compatibility class of the read is determined by taking the intersection of the k-compatibility classes of its constituent k-mers.[From Bray et al. Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology, 2016.]

Then, kallisto optimizes the following RNA-Seq likelihood function using the expectation-maximization (EM) algorithm.  

$$L(\alpha) \propto \prod_{f \in F} \sum_{t \in T} y_{f,t} \frac{\alpha_t}{l_t} = \prod_{e \in E}\left(  \sum_{t \in e} \frac{\alpha_t}{l_t} \right )^{c_e}$$

In this function,  $F$ is the set of fragments (or reads), $T$ is the set of transcripts, $l_t$ is the (effective) length of transcript $t$ and **y**$_{f,t}$ is a compatibility matrix defined as 1 if  fragment $f$ is compatible with $t$ and 0 otherwise.  The parameters $α_t$ are the probabilities of selecting reads from a transcript $t$.  These $α_t$ are the parameters of interest since they represent the isoforms abundances or relative expressions.

To make things faster, the compatibility matrix is collapsed (factorized) into equivalence classes. An equivalent class consists of all the reads compatible with the same subsets of transcripts. The EM algorithm is applied to equivalence classes (not to reads).  Each $α_t$ will be optimized to maximise the likelihood of transcript abundances given observations of the equivalence classes. The speed of the method makes it possible to evaluate the uncertainty of the  abundance estimates for each RNA-Seq sample using a bootstrap technique.  For a given sample containing $N$ reads, a bootstrap sample is generated from the sampling of $N$ counts from a multinomial distribution over the equivalence classes derived from the original sample.  The EM algorithm is applied on those sampled equivalence class counts to estimate transcript abundances. The bootstrap information is then used in downstream analyses such as determining which genes are differentially expressed.

Practically, we can illustrate the different steps involved in kallisto using a small example.  Starting from a tiny genome with 3 transcripts, assume that the RNA-Seq experiment produced 4 reads as depicted in the image below.

https://i.imgur.com/5JDpQO8.png

The first step is to build the T-DBG graph and the kallisto index.  All transcript sequences are decomposed into k-mers (here k=5) to construct the colored de Bruijn graph. Not all nodes are represented in the following drawing.  The idea is that each different transcript will lead to a different path in the graph.  The strand is not taken into account, kallisto is strand-agnostic.

https://i.imgur.com/4oW72z0.png

Once the index is built, the four reads of the sequenced sample can be analysed.  They are decomposed into k-mers (k=5 here too) and the pre-built index is used to determine the k-compatibility class of each k-mer. Then, the k-compatibility class of each read is computed. For example, for read 1, the intersection of all the k-compatibility classes of its k-mers suggests that it might come from transcript 1 or transcript 2.

https://i.imgur.com/woektCH.png

This is done for the four reads enabling the construction of the compatibility matrix  **y**$_{f,t}$ which is part of the RNA-Seq likelihood function.  In this equation, the $α_t$ are the parameters that we want to estimate.

https://i.imgur.com/Hp5QJvH.png

The EM algorithm being too slow to be applied on millions of reads, the compatibility matrix **y**$_{f,t}$ is factorized into equivalence classes and a count is computed for each class (how many reads are represented by this equivalence class). The EM algorithm uses this collapsed information to maximize the new equivalent RNA-Seq likelihood function and optimize the $α_t$.

https://i.imgur.com/qzsEq8A.png

The EM algorithm stops when for every transcript $t$, $α_tN$ > 0.01 changes less than 1%, where $N$ is the total number of reads.

arxiv.org
scholar.google.com

Communication-Efficient Learning of Deep Networks from Decentralized Data
McMahan, H. Brendan and Moore, Eider and Ramage, Daniel and Hampson, Seth and Arcas, Blaise Agüera y
- 2016 via Local Bibsonomy
Keywords: distributed, deep_learning, hpc

[link] Summary by CodyWild 2 years ago

Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server.  

Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by: 

- Sending a copy of the current model to each device/client
- Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device
- Sending that gradient back to the central server
- Averaging those gradients and applying them all at once to a central model

The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. 

Specifically, their algorithm, FedAvg, works by: 

- Dividing the data on a given device into batches of size B
- Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server
- Repeating this for E epochs

Conceptually, this should work perfectly well in the world where data from each batch is IID - independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. 

The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily non-IID (in their case: where different "devices" had non-overlapping sets of digits). 

https://i.imgur.com/xq9vi8S.png

They show that, in both the IID and non-IID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases.

papers.nips.cc
scholar.google.com

Sequence to Sequence Learning with Neural Networks
Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V.
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* The paper proposes a general and end-to-end approach for sequence learning that uses two deep LSTMs, one to map input sequence to vector space and another to map vector to the output sequence.
* For sequence learning, Deep Neural Networks (DNNs) requires the dimensionality of input and output sequences be known and fixed. This limitation is overcome by using the two LSTMs.
* [Link to the paper](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)

#### Model

* Recurrent Neural Networks (RNNs) generalizes feed forward neural networks to sequences.
* Given a sequence of inputs $(x_{1}, x_{2}...x_{t})$, RNN computes a sequence of outputs $(y_1, y_2...y_t')$ by iterating over the following equation:

$$h_t = sigm(W^{hx}x_t + W^{hh} h_{t-1})$$

$$y^{t} = W^{yh}h_{t}$$

* To map variable length sequences, the input is mapped to a fixed size vector using an RNN and this fixed size vector is mapped to output sequence using another RNN.
* Given the long-term dependencies between the two sequences, LSTMs are preferred over RNNs.
* LSTMs estimate the conditional probability *p(output sequence | input sequence)* by first mapping the input sequence to a fixed dimensional representation and then computing the probability of output with a standard LST-LM formulation.

##### Differences between the model and standard LSTMs

* The model uses two LSTMs (one for input sequence and another for output sequence), thereby increasing the number of model parameters at negligible computing cost.
* Model uses Deep LSTMs (4 layers).
* The words in the input sequences are reversed to introduce short-term dependencies and to reduce the "minimal time lag". By reversing the word order, the first few words in the source sentence (input sentence) are much closer to first few words in the target sentence (output sentence) thereby making it easier for LSTM to "establish" communication between input and output sentences.

#### Experiments

* WMT'14 English to French dataset containing 12 million sentences consisting of 348 million French words and 304 million English words.
* Model tested on translation task and on the task of re-scoring the n-best results of baseline approach.
* Deep LSTMs trained in sentence pairs by maximizing the log probability of a correct translation $T$, given the source sentence $S$
* The training objective is to maximize this log probability, averaged over all the pairs in the training set.
* Most likely translation is found by performing a simple, left-to-right beam search for translation.
* A hard constraint is enforced on the norm of the gradient to avoid the exploding gradient problem.
* Min batches are selected to have sentences of similar lengths to reduce training time.
* Model performs better when reversed sentences are used for training.
* While the model does not beat the state-of-the-art, it is the first pure neural translation system to outperform a phrase-based SMT baseline.
* The model performs well on long sentences as well with only a minor degradation for the largest sentences.
* The paper prepares ground for the application of sequence-to-sequence based learning models in other domains by demonstrating how a simple and relatively unoptimised neural model could outperform a mature SMT system on translation tasks.

arxiv.org
arxiv-vanity.com
scholar.google.com

NO Need to Worry about Adversarial Examples in Object Detection in Autonomous Vehicles
Jiajun Lu and Hussein Sibai and Evan Fabry and David Forsyth
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV, cs.AI, cs.CR
more

[link] Summary by David Stutz 6 years ago

Lu et al. present experiments regarding adversarial examples in the real world, i.e. after printing them. Personally, I find it interesting that researchers are studying how networks can be fooled by physically perturbing images. For me, one of the main conclusions it that it is very hard to evaluate the robustness of networks against physical perturbations. Often it is unclear whether changed lighting conditions, distances or viewpoints to objects might cause the network to fail – which means that the adversarial perturbation did not cause this failure.

Also found this summary at [davidstutz.de](https://davidstutz.de/category/reading/).