ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

End-to-End Instance Segmentation and Counting with Recurrent Attention
Mengye Ren and Richard S. Zemel
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.CV
more

[link] Summary by Joseph Paul Cohen 8 years ago

This combines the ideas of recurrent attention to perform object detection in an image \cite{1406.6247} for multiple objects \cite{1412.7755} with semantic segmentation \cite{1505.04366}. 

Segmenting subregions is to avoid a global resolution bias (the object would take up the majority of pixels) and to allow multiple scales of objects to be segmented. 

Here is a video that demos the method described in the paper:

https://youtu.be/BMVDhTjEfBU

arxiv.org
arxiv-vanity.com
scholar.google.com

Key-Value Memory Networks for Directly Reading Documents
Alexander Miller and Adam Fisch and Jesse Dodge and Amir-Hossein Karimi and Antoine Bordes and Jason Weston
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Shagun Sodhani 8 years ago

### Introduction

* Knowledge Bases (KBs) are effective tools for Question Answering (QA) but are often too restrictive (due to fixed schema) and too sparse (due to limitations of Information Extraction (IE) systems).
* The paper proposes Key-Value Memory Networks, a neural network architecture based on [Memory Networks](https://gist.github.com/shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e) that can leverage both KBs and raw data for QA.
* The paper also introduces MOVIEQA, a new QA dataset that can be answered by a perfect KB, by Wikipedia pages and by an imperfect KB obtained using IE techniques thereby allowing a comparison between systems using any of the three sources.
* [Link to the paper](https://arxiv.org/abs/1606.03126).

### Related Work

* TRECQA and WIKIQA are two benchmarks where systems need to select the sentence containing the correct answer, given a question and a list of candidate sentences. 
* These datasets are small and make it difficult to compare the  systems using different sources.
* Best results on these benchmarks are reported by CNNs and RNNs with attention mechanism.

### Key-Value Memory Networks

* Extension of [Memory Networks Model](https://gist.github.com/shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e).
* Generalises the way context is stored in memory.
* Comprises of a memory made of slots in the form of pair of vectors $(k_{1}, v_{1})...(k_{m}, v_{m})$ to encode long-term and short-term context.

#### Reading the Memory

* **Key Hashing** - Question, *x* is used to preselect a subset of array $(k_{h1}, v_{h1})...(k_{hN}, v_{hN})$ where the key shares atleast one word with *x* and frequency of the words is less than 1000.
* **Key Addresing** - Each candidate memory is assigned a relevance probability:
    * $p_hi$ = softmax($Aφ_X(x).Aφ_K (k_{hi}))$
    * φ is a feature map of dimension *D* and *A* is a *dxD* matrix.
* **Value Reading** - Value of memories are read by taking their weighted sum using addressing probabilites and a vector *o* is returned.
* $o = sum(p_{hi} A\phi_V(v_{hi}))$
* Memory access process conducted by "controller" neural network using $q = Aφ_X (x)$ as the query.
* Query is updated using
    * $q_2 = R_1 (q+o)$
* Addressing and reading steps are repeated using new $R_i$ matrices to retrieve more pertinent information in subsequent access.
* After a fixed number of hops, H, resulting state of controller is used to compute a final prediction.
* $a = \text{argmax}(\text{softmax}(q_{H+1}^T B\phi_Y (y_i)))$
where $y_i$ are the possible candidate outputs and $B$ is a $dXD$ matrix.
* The network is trained end-to-end using a cross entropy loss, backpropogation and stochastic gradient.
* End-to-End Memory Networks can be viewed as a special case of Key-Value Memory Networks by setting key and value to be the same for all the memories.

#### Variants of Key-Value Memories

* $φ_x$ and $φ_y$ - feature map corresponding to query and answer are fixed as bag-of-words representation.

##### KB Triple

* Triplets of the form "subject relation object" can be represented in Key-Value Memory Networks with subject and relation as the key and object as the value.
* In standard Memory Networks, the whole triplet would have to be embedded in the same memory slot.
* The reversed relations "object is_related_to subject" can also be stored.

##### Sentence Level

* A document can be split into sentences with each sentence encoded in the key-value pair of the memory slot as a bag-of-words.

##### Window Level

* Split the document in the windows of W words and represent it as bag-of-words. 
* The window becomes the key and the central word becomes the value.

##### Window + Centre Encoding

* Instead of mixing the window centre with the rest of the words, double the size of the dictionary and encode the centre of the window and the value using the second dictionary.

##### Window + Title

* Since the title of the document could contain useful information, the word window can be encoded as the key and document title as the value.
* The key could be augmented with features like "_window_" and "_title_" to distinguish between different cases.

### MOVIEQA Benchmark

#### Knowledge Representation

* Doc - Raw documents (from Wikipedia) related to movies.
* KB - Graph-based KB made of entities and relations.
* IE - Performing Information Extraction on Wikipedia to create a KB.
* The QA pairs should be answerable by both raw document and KB so that the three approaches can be compared and the gap between the three solutions can be closed.
* The dataset has more than 100000 QA pairs, making it much larger than most existing datasets.

### Experiments

#### MOVIEQA

##### Systems Compared

* [Bordes et al's QA system](TBD)
* [Supervised Embeddings](TBD)(without KB)
* [Memory Networks](TBD)
* Key-Value Memory Networks

##### Observations

* Key-Value Memory Networks outperforms all methods on all data sources.
* KB > Doc > IE
* The best memory representation for directly reading documents uses "Window Level + Centre Encoding + Title".

##### KB vs Synthetic Document Analysis

* Given KB triplets, construct synthetic "Wikipedia" articles using templates, conjunctions and coreferences to determine the causes for the gap in performance when using KB vs doc.
* Loss in One Template sentences are due to the difficulty of extracting subject, relation and object from the artificial docs. 
* Using multiple templates does not deteriorate performance much. But conjunctions and coreferences cause a dip in performance.

#### WIKIQA

* Given a question, select the sentence (from Wikipedia document) that best answers the question.
* Key-Value Memory Networks outperforms all other solutions though it is only marginally better than LDC and attentive models based on CNNs and RNNs.

arxiv.org
arxiv-vanity.com
scholar.google.com

Query-Regression Networks for Machine Comprehension
Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.NE
more

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* **Machine Comprehension (MC)** - given a natural language sentence, answer a natural language question.
* **End-To-End MC** - can not use language resources like dependency parsers. The only supervision during training is the correct answer.
* **Query Regression Network (QRN)** - Variant of Recurrent Neural Network (RNN).
* [Link to the paper](http://arxiv.org/abs/1606.04582)

#### Related Work

* Long Short-Term Memory (LSTM) and Gated Recurrence Unit (GRU) are popular choices to model sequential data but perform poorly on end-to-end MC due to long-term dependencies.
* Attention Models with shared external memory focus on single sentences in each layer but the models tend to be insensitive to the time step of the sentence being accessed.
* **Memory Networks (and MemN2N)**
    * Add time-dependent variable to the sentence representation.
    * Summarize the memory in each layer to control attention in the next layer.
* **Dynamic Memory Networks (and DMN+)**
    * Combine RNN and attention mechanism to incorporate time dependency.
    * Uses 2 GRU
        * time-axis GRU - Summarize the memory in each layer.
        * layer-axis GRU - Control the attention in each layer.
* QRN is a much simpler model without any memory summarized node.

#### QRN

* Single recurrent unit that updates its internal state through time and layers.
* Inputs
    * $q_{t}$ - local query vector
    * $x_{t}$ - sentence vector
* Outputs
    * $h_{t}$ - reduced query vector
    * $x_{t}$ - sentence vector without any modifications
* Equations
    * $z_{t} = \alpha (x_{t}, q_{t})$
    * &alpha is the **update gate function** to measure the relevance between input sentence and local query.
    * $h`_{t} = \gamma (x_{t}, q_{t})$
    * &gamma is the **regression function** to transform the local query into regressed query.
    * $h_{t} = z_{t} \* h'_{t} + (1 - z_{t}) \* h_{t-1}$
* To create a multi layer model, output of current layer becomes input to the next layer.

#### Variants

* **Reset gate function** ($r_{t}$) to reset or nullify the regressed query $h`_{t}$ (inspired from GRU).
    * The new equation becomes $h_{t} = z_{t}\*r_{t}\* h`_{t} + (1 - z_{t})\*h_{t-1}$
* **Vector gates** - update and reset gate functions can produce vectors instead of scalar values (for finer control).
* **Bidirectional** - QRN can look at both past and future sentences while regressing the queries.
    * $q_{t}^{k+1} = h_{t}^{k, \text{forward}} + h_{t}^{k, \text{backward}}$.
    * The variables of update and regress functions are shared between the two directions.

#### Parallelization

* Unlike most RNN based models, recurrent updates in QRN can be computed in parallel across time.
* For details and equations, refer the [paper](http://arxiv.org/abs/1606.04582).

#### Module Details

##### Input Modules
    
* A trainable embedding matrix A is used to encode the one-hot vector of each word in the input sentence into a d-dimensional vector.
* Position Encoder is used to obtain the sentence representation from the d-dimensional vectors.
* Question vectors are also obtained in a similar manner.

##### Output Module

* A V-way single-layer softmax classifier is used to map predicted answer vector $y$ to a V-dimensional sparse vector $v$.
* The natural language answer $y is the arg max word in $v$.


#### Results

* [bAbI QA](https://gist.github.com/shagunsodhani/12691b76addf149a224c24ab64b5bdcc) dataset used.
* QRN on 1K dataset with '2rb' (2 layers + reset gate + bidirectional) model and on 10K dataset with '2rvb' (2 layers + reset gate + vector gate + bidirectional) outperforms MemN2N 1K and 10K models respectively.
* Though DMN+ outperforms QRN with a small margin, QRN are simpler and faster to train (the paper made the comment on the speed of training without reporting the training time of the two models).
* With very few layers, the model lacks reasoning ability while with too many layers, the model becomes difficult to train.
* Using vector gates works for large datasets while hurts for small datasets.
* Unidirectional models perform poorly.
* The intermediate query updates can be interpreted in natural language to understand the flow of information in the network.

arxiv.org
arxiv-vanity.com
scholar.google.com

Conditional Image Generation with PixelCNN Decoders
Aaron van den Oord and Nal Kalchbrenner and Oriol Vinyals and Lasse Espeholt and Alex Graves and Koray Kavukcuoglu
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV, cs.LG
more

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* The paper explores the domain of conditional image generation by adopting and improving PixelCNN architecture.
* [Link to the paper](https://arxiv.org/abs/1606.05328)

#### Based on PixelRNN and PixelCNN

* Models image pixel by pixel by decomposing the joint image distribution as a product of conditionals.
* PixelRNN uses two-dimensional LSTM while PixelCNN uses convolutional networks.
* PixelRNN gives better results but PixelCNN is faster to train.

#### Gated PixelCNN

* PixelRNN outperforms PixelCNN due to the larger receptive field and because they contain multiplicative units, LSTM gates, which allow modelling more complex interactions.
* To account for these, deeper models and gated activation units (equation 2 in the [paper](https://arxiv.org/abs/1606.05328)) can be used respectively.
* Masked convolutions can lead to blind spots in the receptive fields.
* These can be removed by combining 2 convolutional network stacks:
* Horizontal stack - conditions on the current row.
* Vertical stack - conditions on all rows above the current row.
* Every layer in the horizontal stack takes as input the output of the previous layer as well as that of the vertical stack.
* Residual connections are used in the horizontal stack and not in the vertical stack (as they did not seem to improve results in the initial settings).

#### Conditional PixelCNN

* Model conditional distribution of image, given the high-level description of the image, represented using the latent vector h (equation 4 in the [paper](https://arxiv.org/abs/1606.05328))
* This conditioning does not depend on the location of the pixel in the image.
* To consider the location as well, map h to spatial representation $s = m(h)$ (equation 5 in the the [paper](https://arxiv.org/abs/1606.05328))

#### PixelCNN Auto-Encoders

* Start with a traditional auto-encoder architecture and replace the deconvolutional decoder with PixelCNN and train the network end-to-end.

#### Experiments

* For unconditional modelling, Gated PixelCNN either outperforms PixelRNN or performs almost as good and takes much less time to train.
* In the case of conditioning on ImageNet classes, the log likelihood measure did not improve a lot but the visual quality of the generated sampled was significantly improved.
* Paper also included sample images generated by conditioning on human portraits and by training a PixelCNN auto-encoder on ImageNet patches.

arxiv.org
arxiv-vanity.com
scholar.google.com

Stochastic Backpropagation through Mixture Density Distributions
Alex Graves
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.NE
more

[link] Summary by Hugo Larochelle 8 years ago

This paper derives an algorithm for passing gradients through a sample from a mixture of Gaussians. While the reparameterization trick allows to get the gradients with respect to the Gaussian means and covariances, the same trick cannot be invoked for the mixing proportions parameters (essentially because they are the parameters of a multinomial discrete distribution over the Gaussian components, and the reparameterization trick doesn't extend to discrete distributions).

One can think of the derivation as proceeding in 3 steps:

1. Deriving an estimator for gradients a sample from a 1-dimensional density $f(x)$ that is such that $f(x)$ is differentiable and its cumulative distribution function (CDF) $F(x)$ is tractable:

  $\frac{\partial \hat{x}}{\partial \theta} = - \frac{1}{f(\hat{x})}\int_{t=-\infty}^{\hat{x}} \frac{\partial f(t)}{\partial \theta} dt$

  where $\hat{x}$ is a sample from density $f(x)$ and $\theta$ is any parameter of $f(x)$ (the above is a simplified version of Equation 6). This is probably the most important result of the paper, and is based on a really clever use of the general form of the Leibniz integral rule.

2. Noticing that one can sample from a $D$-dimensional Gaussian mixture by decomposing it with the product rule $f({\bf x}) = \prod_{d=1}^D f(x_d|{\bf x}_{<d})$ and using ancestral sampling, where each $f(x_d|{\bf x}_{<d})$ are themselves 1-dimensional mixtures (i.e. with differentiable densities and tractable CDFs)

3. Using the 1-dimensional gradient estimator (of Equation 6) and the chain rule to backpropagate through the ancestral sampling procedure. This requires computing the integral in the expression for $\frac{\partial \hat{x}}{\partial \theta}$ above, where $f(x)$ is one of the 1D conditional Gaussian mixtures and $\theta$ is a mixing proportion parameter $\pi_j$. As it turns out, this integral has an analytical form (see Equation 22).

**My two cents**

This is a really surprising and neat result. The author mentions it could be applicable to variational autoencoders (to support posteriors that are mixtures of Gaussians), and I'm really looking forward to read about whether that can be successfully done in practice. 

The paper provides the derivation only for mixtures of Gaussians with diagonal covariance matrices. It is mentioned that extending to non-diagonal covariances is doable. That said, ancestral sampling with non-diagonal covariances would become more computationally expensive, since the conditionals under each Gaussian involves a matrix inverse.

Beyond the case of Gaussian mixtures, Equation 6 is super interesting in itself as its application could go beyond that case. This is probably why the paper also derived a sampling-based estimator for Equation 6, in Equation 9. However, that estimator might be inefficient, since it involves sampling from Equation 10 with rejection, and it might take a lot of time to get an accepted sample if $\hat{x}$ is very small. Also, a good estimate of Equation 6 might require *multiple* samples from Equation 10. 

Finally, while I couldn't find any obvious problem with the mathematical derivation, I'd be curious to see whether using the same approach to derive a gradient on one of the Gaussian mean or standard deviation parameters gave a gradient that is consistent with what the reparameterization trick provides.

3 Comments