Query-Regression Networks for Machine Comprehension
Minjoon Seo
and
Hannaneh Hajishirzi
and
Ali Farhadi
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CL, cs.NE
First published: 2016/06/14 (8 years ago) Abstract: We present Query-Regression Network (QRN), a variant of Recurrent Neural
Network (RNN) that is suitable for end-to-end machine comprehension. While
previous work largely relied on external memory and global softmax attention
mechanism, QRN is a single recurrent unit with internal memory and local
sigmoid attention. Unlike most RNN-based models, QRN is able to effectively
handle long-term dependencies and is highly parallelizable. In our experiments
we show that QRN obtains the state-of-the-art result in end-to-end bAbI QA
tasks.
#### Introduction
* **Machine Comprehension (MC)** - given a natural language sentence, answer a natural language question.
* **End-To-End MC** - can not use language resources like dependency parsers. The only supervision during training is the correct answer.
* **Query Regression Network (QRN)** - Variant of Recurrent Neural Network (RNN).
* [Link to the paper](http://arxiv.org/abs/1606.04582)
#### Related Work
* Long Short-Term Memory (LSTM) and Gated Recurrence Unit (GRU) are popular choices to model sequential data but perform poorly on end-to-end MC due to long-term dependencies.
* Attention Models with shared external memory focus on single sentences in each layer but the models tend to be insensitive to the time step of the sentence being accessed.
* **Memory Networks (and MemN2N)**
* Add time-dependent variable to the sentence representation.
* Summarize the memory in each layer to control attention in the next layer.
* **Dynamic Memory Networks (and DMN+)**
* Combine RNN and attention mechanism to incorporate time dependency.
* Uses 2 GRU
* time-axis GRU - Summarize the memory in each layer.
* layer-axis GRU - Control the attention in each layer.
* QRN is a much simpler model without any memory summarized node.
#### QRN
* Single recurrent unit that updates its internal state through time and layers.
* Inputs
* $q_{t}$ - local query vector
* $x_{t}$ - sentence vector
* Outputs
* $h_{t}$ - reduced query vector
* $x_{t}$ - sentence vector without any modifications
* Equations
* $z_{t} = \alpha (x_{t}, q_{t})$
* &alpha is the **update gate function** to measure the relevance between input sentence and local query.
* $h`_{t} = \gamma (x_{t}, q_{t})$
* &gamma is the **regression function** to transform the local query into regressed query.
* $h_{t} = z_{t} \* h'_{t} + (1 - z_{t}) \* h_{t-1}$
* To create a multi layer model, output of current layer becomes input to the next layer.
#### Variants
* **Reset gate function** ($r_{t}$) to reset or nullify the regressed query $h`_{t}$ (inspired from GRU).
* The new equation becomes $h_{t} = z_{t}\*r_{t}\* h`_{t} + (1 - z_{t})\*h_{t-1}$
* **Vector gates** - update and reset gate functions can produce vectors instead of scalar values (for finer control).
* **Bidirectional** - QRN can look at both past and future sentences while regressing the queries.
* $q_{t}^{k+1} = h_{t}^{k, \text{forward}} + h_{t}^{k, \text{backward}}$.
* The variables of update and regress functions are shared between the two directions.
#### Parallelization
* Unlike most RNN based models, recurrent updates in QRN can be computed in parallel across time.
* For details and equations, refer the [paper](http://arxiv.org/abs/1606.04582).
#### Module Details
##### Input Modules
* A trainable embedding matrix A is used to encode the one-hot vector of each word in the input sentence into a d-dimensional vector.
* Position Encoder is used to obtain the sentence representation from the d-dimensional vectors.
* Question vectors are also obtained in a similar manner.
##### Output Module
* A V-way single-layer softmax classifier is used to map predicted answer vector $y$ to a V-dimensional sparse vector $v$.
* The natural language answer $y is the arg max word in $v$.
#### Results
* [bAbI QA](https://gist.github.com/shagunsodhani/12691b76addf149a224c24ab64b5bdcc) dataset used.
* QRN on 1K dataset with '2rb' (2 layers + reset gate + bidirectional) model and on 10K dataset with '2rvb' (2 layers + reset gate + vector gate + bidirectional) outperforms MemN2N 1K and 10K models respectively.
* Though DMN+ outperforms QRN with a small margin, QRN are simpler and faster to train (the paper made the comment on the speed of training without reporting the training time of the two models).
* With very few layers, the model lacks reasoning ability while with too many layers, the model becomes difficult to train.
* Using vector gates works for large datasets while hurts for small datasets.
* Unidirectional models perform poorly.
* The intermediate query updates can be interpreted in natural language to understand the flow of information in the network.