Teaching Machines to Read and Comprehend
Hermann, Karl Moritz
and
Kociský, Tomás
and
Grefenstette, Edward
and
Espeholt, Lasse
and
Kay, Will
and
Suleyman, Mustafa
and
Blunsom, Phil
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords:
dblp
#### Introduction
* Build a supervised reading comprehension data set using news corpus.
* Compare the performance of neural models and state-of-the-art natural language processing model on reading comprehension task.
* [Link to the paper](http://arxiv.org/abs/1506.03340v3)
#### Reading Comprehension
* Estimate conditional probability $p(a|c, q)$, where $c$ is a context document, $q$ is a query related to the document, and $a$ is the answer to that query.
#### Dataset Generation
* Use online newspapers (CNN and DailyMail) and their matching summaries.
* Parse summaries and bullet points into Cloze style questions.
* Generate corpus of document-query-answer triplets by replacing one entity at a time with a placeholder.
* Data anonymized and randomised using coreference systems, abstract entity markers and random permutation of the entity markers.
* The processed data set is more focused in terms of evaluating reading comprehension as models can not exploit co-occurrence.
#### Models
##### Baseline Models
* **Majority Baseline**
* Picks the most frequently observed entity in the context document.
* **Exclusive Majority**
* Picks the most frequently observed entity in the context document which is not observed in the query.
##### Symbolic Matching Models
* **Frame-Semantic Parsing**
* Parse the sentence to find predicates to answer questions like "who did what to whom".
* Extracting entity-predicate triples $(e1,V, e2)$ from query $q$ and context document $d$
* Resolve queries using rules like `exact match`, `matching entity` etc.
* **Word Distance Benchmark**
* Align placeholder of Cloze form questions with each possible entity in the context document and calculate the distance between the question and the context around the aligned entity.
* Sum the distance of every word in $q$ to their nearest aligned word in $d$
##### Neural Network Models
* **Deep LSTM Reader**
* Test the ability of Deep LSTM encoders to handle significantly longer sequences.
* Feed the document query pair as a single large document, one word at a time.
* Use Deep LSTM cell with skip connections from input to hidden layers and hidden layer to output.
* **Attentive Reader**
* Employ attention model to overcome the bottleneck of fixed width hidden vector.
* Encode the document and the query using separate bidirectional single layer LSTM.
* Query encoding is obtained by concatenating the final forward and backwards outputs.
* Document encoding is obtained by a weighted sum of output vectors (obtained by concatenating the forward and backwards outputs).
* The weights can be interpreted as the degree to which the network attends to a particular token in the document.
* Model completed by defining a non-linear combination of document and query embedding.
* **Impatient Reader**
* As an add-on to the attentive reader, the model can re-read the document as each query token is read.
* Model accumulates the information from the document as each query token is seen and finally outputs a joint document query representation in the form of a non-linear combination of document embedding and query embedding.
#### Result
* Attentive and Impatient Readers outperform all other models highlighting the benefits of attention modelling.
* Frame-Semantic pipeline does not scale to cases where several methods are needed to answer a query.
* Moreover, they provide poor coverage as a lot of relations do not adhere to the default predicate-argument structure.
* Word Distance approach outperformed the Frame-Semantic approach as there was significant lexical overlap between the query and the document.
* The paper also includes heat maps over the context documents to visualise the attention mechanism.