Query-Regression Networks for Machine Comprehension on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com
Query-Regression Networks for Machine Comprehension
Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.NE
more
Summaries/Notes 1
[link] Summary by Shagun Sodhani 8 years ago
#### Introduction

* **Machine Comprehension (MC)** - given a natural language sentence, answer a natural language question.
* **End-To-End MC** - can not use language resources like dependency parsers. The only supervision during training is the correct answer.
* **Query Regression Network (QRN)** - Variant of Recurrent Neural Network (RNN).
* [Link to the paper](http://arxiv.org/abs/1606.04582)

#### Related Work

* Long Short-Term Memory (LSTM) and Gated Recurrence Unit (GRU) are popular choices to model sequential data but perform poorly on end-to-end MC due to long-term dependencies.
* Attention Models with shared external memory focus on single sentences in each layer but the models tend to be insensitive to the time step of the sentence being accessed.
* **Memory Networks (and MemN2N)**
    * Add time-dependent variable to the sentence representation.
    * Summarize the memory in each layer to control attention in the next layer.
* **Dynamic Memory Networks (and DMN+)**
    * Combine RNN and attention mechanism to incorporate time dependency.
    * Uses 2 GRU
        * time-axis GRU - Summarize the memory in each layer.
        * layer-axis GRU - Control the attention in each layer.
* QRN is a much simpler model without any memory summarized node.

#### QRN

* Single recurrent unit that updates its internal state through time and layers.
* Inputs
    * $q_{t}$ - local query vector
    * $x_{t}$ - sentence vector
* Outputs
    * $h_{t}$ - reduced query vector
    * $x_{t}$ - sentence vector without any modifications
* Equations
    * $z_{t} = \alpha (x_{t}, q_{t})$
    * &alpha is the **update gate function** to measure the relevance between input sentence and local query.
    * $h`_{t} = \gamma (x_{t}, q_{t})$
    * &gamma is the **regression function** to transform the local query into regressed query.
    * $h_{t} = z_{t} \* h'_{t} + (1 - z_{t}) \* h_{t-1}$
* To create a multi layer model, output of current layer becomes input to the next layer.

#### Variants

* **Reset gate function** ($r_{t}$) to reset or nullify the regressed query $h`_{t}$ (inspired from GRU).
    * The new equation becomes $h_{t} = z_{t}\*r_{t}\* h`_{t} + (1 - z_{t})\*h_{t-1}$
* **Vector gates** - update and reset gate functions can produce vectors instead of scalar values (for finer control).
* **Bidirectional** - QRN can look at both past and future sentences while regressing the queries.
    * $q_{t}^{k+1} = h_{t}^{k, \text{forward}} + h_{t}^{k, \text{backward}}$.
    * The variables of update and regress functions are shared between the two directions.

#### Parallelization

* Unlike most RNN based models, recurrent updates in QRN can be computed in parallel across time.
* For details and equations, refer the [paper](http://arxiv.org/abs/1606.04582).

#### Module Details

##### Input Modules
    
* A trainable embedding matrix A is used to encode the one-hot vector of each word in the input sentence into a d-dimensional vector.
* Position Encoder is used to obtain the sentence representation from the d-dimensional vectors.
* Question vectors are also obtained in a similar manner.

##### Output Module

* A V-way single-layer softmax classifier is used to map predicted answer vector $y$ to a V-dimensional sparse vector $v$.
* The natural language answer $y is the arg max word in $v$.


#### Results

* [bAbI QA](https://gist.github.com/shagunsodhani/12691b76addf149a224c24ab64b5bdcc) dataset used.
* QRN on 1K dataset with '2rb' (2 layers + reset gate + bidirectional) model and on 10K dataset with '2rvb' (2 layers + reset gate + vector gate + bidirectional) outperforms MemN2N 1K and 10K models respectively.
* Though DMN+ outperforms QRN with a small margin, QRN are simpler and faster to train (the paper made the comment on the speed of training without reporting the training time of the two models).
* With very few layers, the model lacks reasoning ability while with too many layers, the model becomes difficult to train.
* Using vector gates works for large datasets while hurts for small datasets.
* Unidirectional models perform poorly.
* The intermediate query updates can be interpreted in natural language to understand the flow of information in the network.
Your comment:
Write your summary here (You can use $\LaTeX$ and markdown syntax):
Anon Private