WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia
Daniel Hewlett and Alexandre Lacoste and Llion Jones and Illia Polosukhin and Andrew Fandrianto and Jay Han and Matthew Kelcey and David Berthelot
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

Summaries/Notes 1

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* Large scale natural language understanding task - predict text values given a knowledge base.
* Accompanied by a large dataset generated using Wikipedia
* [Link to the paper](http://www.aclweb.org/anthology/P/P16/P16-1145.pdf)

#### Dataset

* WikiReading dataset built using Wikidata and Wikipedia.
* Wikidata consists of statements of the form (property, value) about different items
* 80M statements, 16M items and 884 properties.
* These statements are grouped by items to get (item, property, answer) tuples where the answer is a set of values.
* Items are further replaced by their Wikipedia documents to generate 18.58M statements of the form (document, property, answer).
* Task is to predict answer given document and property.
* Properties are divided into 2 classes:
* **Categorical properties** - properties with a small number of possible answers. Eg gender.
* **Relational properties** - properties with unique answers. Eg date of birth.
* This classification is done on the basis of the entropy of answer distribution.
* Properties with entropy less than 0.7 are classified as categorical properties.
* Answer distribution has a small number of very high-frequency answers (head) and a large number of answers with very small frequency (tail).
* 30% of the answers do not appear in the training set and must be inferred from the document.

#### Models

##### Answer Classification

* Consider WikiReading as classification task and treat each answer as a class label.

###### Baseline

* Linear model over Bag of Words (BoW) features.
* Two BoW vectors computed - one for the document and other for the property. These are concatenated into a single feature vector.

###### Neural Networks Method

* Encode property and document into a joint representation which is fed into a softmax layer.

* **Average Embeddings BoW**

* Average the BoW embeddings for documents and property and concatenate to get joint representation.

* **Paragraph Vectors**

* As a variant of the previous method, encode document as a paragraph vector.

* **LSTM Reader**

* LSTM reads the property and document sequence, word-by-word, and uses the final state as joint representation.

* **Attentive Reader**

* Use attention mechanism to focus on relevant parts of the document for a given property.

* **Memory Networks**

* Maps a property p and list of sentences x1, x2, ...xn in a joint representation by attention over the sentences in the document.

##### Answer Extraction

* For relational properties, it makes more sense to model the problem as information extraction than classification.

* **RNNLabeler**

* Use an RNN to read the sequence of words and estimate if a given word is part of the answer.

* **Basic SeqToSeq (Sequence to Sequence)**

* Similar to LSTM Reader but augmented with a second RNN to decode answer as a sequence of words.

* **Placeholder SeqToSeq**

* Extends Basic SeqToSeq to handle OOV (Out of Vocabulary) words by adding placeholders to the vocabulary.
* OOV words in the document and answer are replaced by placeholders so that input and output sentences are a mixture of words and placeholders only.

* **Basic Character SeqToSeq**

* Property encoder RNN reads the property, character-by-character and transforms it into a fixed length vector.
* This becomes the initial hidden state for the second layer of a 2-layer document encoder RNN.
* Final state of this RNN is used by answer decoder RNN to generate answer as a character sequence.

* **Character SeqToSeq with pretraining**

* Train a character-level language model on input character sequence from the training set and use the weights to initiate the first layer of encoder and decoder.

#### Experiments

* Evaluation metric is F1 score (harmonic mean of precision and accuracy).
* All models perform well on categorical properties with neural models outperforming others.
* In the case of relational properties, SeqToSeq models have a clear edge.
* SeqToSeq models also show a great deal of balance between relational and categorical properties.
* Language model pretraining enhances the performance of character SeqToSeq approach.
* Results demonstrate that end-to-end SeqToSeq models are most promising for WikiReading like tasks.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private