[link]
#### Introduction * **Machine Comprehension (MC)** - given a natural language sentence, answer a natural language question. * **End-To-End MC** - can not use language resources like dependency parsers. The only supervision during training is the correct answer. * **Query Regression Network (QRN)** - Variant of Recurrent Neural Network (RNN). * [Link to the paper](http://arxiv.org/abs/1606.04582) #### Related Work * Long Short-Term Memory (LSTM) and Gated Recurrence Unit (GRU) are popular choices to model sequential data but perform poorly on end-to-end MC due to long-term dependencies. * Attention Models with shared external memory focus on single sentences in each layer but the models tend to be insensitive to the time step of the sentence being accessed. * **Memory Networks (and MemN2N)** * Add time-dependent variable to the sentence representation. * Summarize the memory in each layer to control attention in the next layer. * **Dynamic Memory Networks (and DMN+)** * Combine RNN and attention mechanism to incorporate time dependency. * Uses 2 GRU * time-axis GRU - Summarize the memory in each layer. * layer-axis GRU - Control the attention in each layer. * QRN is a much simpler model without any memory summarized node. #### QRN * Single recurrent unit that updates its internal state through time and layers. * Inputs * $q_{t}$ - local query vector * $x_{t}$ - sentence vector * Outputs * $h_{t}$ - reduced query vector * $x_{t}$ - sentence vector without any modifications * Equations * $z_{t} = \alpha (x_{t}, q_{t})$ * &alpha is the **update gate function** to measure the relevance between input sentence and local query. * $h`_{t} = \gamma (x_{t}, q_{t})$ * &gamma is the **regression function** to transform the local query into regressed query. * $h_{t} = z_{t} \* h'_{t} + (1 - z_{t}) \* h_{t-1}$ * To create a multi layer model, output of current layer becomes input to the next layer. #### Variants * **Reset gate function** ($r_{t}$) to reset or nullify the regressed query $h`_{t}$ (inspired from GRU). * The new equation becomes $h_{t} = z_{t}\*r_{t}\* h`_{t} + (1 - z_{t})\*h_{t-1}$ * **Vector gates** - update and reset gate functions can produce vectors instead of scalar values (for finer control). * **Bidirectional** - QRN can look at both past and future sentences while regressing the queries. * $q_{t}^{k+1} = h_{t}^{k, \text{forward}} + h_{t}^{k, \text{backward}}$. * The variables of update and regress functions are shared between the two directions. #### Parallelization * Unlike most RNN based models, recurrent updates in QRN can be computed in parallel across time. * For details and equations, refer the [paper](http://arxiv.org/abs/1606.04582). #### Module Details ##### Input Modules * A trainable embedding matrix A is used to encode the one-hot vector of each word in the input sentence into a d-dimensional vector. * Position Encoder is used to obtain the sentence representation from the d-dimensional vectors. * Question vectors are also obtained in a similar manner. ##### Output Module * A V-way single-layer softmax classifier is used to map predicted answer vector $y$ to a V-dimensional sparse vector $v$. * The natural language answer $y is the arg max word in $v$. #### Results * [bAbI QA](https://gist.github.com/shagunsodhani/12691b76addf149a224c24ab64b5bdcc) dataset used. * QRN on 1K dataset with '2rb' (2 layers + reset gate + bidirectional) model and on 10K dataset with '2rvb' (2 layers + reset gate + vector gate + bidirectional) outperforms MemN2N 1K and 10K models respectively. * Though DMN+ outperforms QRN with a small margin, QRN are simpler and faster to train (the paper made the comment on the speed of training without reporting the training time of the two models). * With very few layers, the model lacks reasoning ability while with too many layers, the model becomes difficult to train. * Using vector gates works for large datasets while hurts for small datasets. * Unidirectional models perform poorly. * The intermediate query updates can be interpreted in natural language to understand the flow of information in the network.
Your comment:
|