ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Teaching Machines to Read and Comprehend
Hermann, Karl Moritz and Kociský, Tomás and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 9 years ago

#### Introduction

* Build a supervised reading comprehension data set using news corpus.
* Compare the performance of neural models and state-of-the-art natural language processing model on reading comprehension task.
* [Link to the paper](http://arxiv.org/abs/1506.03340v3)

#### Reading Comprehension

* Estimate conditional probability $p(a|c, q)$, where $c$ is a context document, $q$ is a query related to the document, and $a$ is the answer to that query.

#### Dataset Generation

* Use online newspapers (CNN and DailyMail) and their matching summaries.
* Parse summaries and bullet points into Cloze style questions.
* Generate corpus of document-query-answer triplets by replacing one entity at a time with a placeholder.
* Data anonymized and randomised using coreference systems, abstract entity markers and random permutation of the entity markers.
* The processed data set is more focused in terms of evaluating reading comprehension as models can not exploit co-occurrence.

#### Models

##### Baseline Models

* **Majority Baseline**
* Picks the most frequently observed entity in the context document.
* **Exclusive Majority**
* Picks the most frequently observed entity in the context document which is not observed in the query.

##### Symbolic Matching Models

* **Frame-Semantic Parsing**
* Parse the sentence to find predicates to answer questions like "who did what to whom".
* Extracting entity-predicate triples $(e1,V, e2)$ from query $q$ and context document $d$
* Resolve queries using rules like `exact match`, `matching entity` etc.

* **Word Distance Benchmark**
* Align placeholder of Cloze form questions with each possible entity in the context document and calculate the distance between the question and the context around the aligned entity.
* Sum the distance of every word in $q$ to their nearest aligned word in $d$

##### Neural Network Models

* **Deep LSTM Reader**
* Test the ability of Deep LSTM encoders to handle significantly longer sequences.
* Feed the document query pair as a single large document, one word at a time.
* Use Deep LSTM cell with skip connections from input to hidden layers and hidden layer to output.

* **Attentive Reader**
* Employ attention model to overcome the bottleneck of fixed width hidden vector.
* Encode the document and the query using separate bidirectional single layer LSTM.
* Query encoding is obtained by concatenating the final forward and backwards outputs.
* Document encoding is obtained by a weighted sum of output vectors (obtained by concatenating the forward and backwards outputs).
* The weights can be interpreted as the degree to which the network attends to a particular token in the document.
* Model completed by defining a non-linear combination of document and query embedding.

* **Impatient Reader**
* As an add-on to the attentive reader, the model can re-read the document as each query token is read.
* Model accumulates the information from the document as each query token is seen and finally outputs a joint document query representation in the form of a non-linear combination of document embedding and query embedding.

#### Result

* Attentive and Impatient Readers outperform all other models highlighting the benefits of attention modelling.
* Frame-Semantic pipeline does not scale to cases where several methods are needed to answer a query.
* Moreover, they provide poor coverage as a lot of relations do not adhere to the default predicate-argument structure.
* Word Distance approach outperformed the Frame-Semantic approach as there was significant lexical overlap between the query and the document.
* The paper also includes heat maps over the context documents to visualise the attention mechanism.

arxiv.org
arxiv-vanity.com
scholar.google.com

Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems
Jesse Dodge and Andreea Gane and Xiang Zhang and Antoine Bordes and Sumit Chopra and Alexander Miller and Arthur Szlam and Jason Weston
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CL, cs.LG
more

[link] Summary by Shagun Sodhani 9 years ago

#### Introduction

* The paper presents a suite of benchmark tasks to evaluate end-to-end dialogue systems such that performing well on the tasks is a necessary (but not sufficient) condition for a fully functional dialogue agent.
* [Link to the paper](https://research.facebook.com/publications/evaluating-prerequisite-qualities-for-learning-end-to-end-dialog-systems/)

#### Dataset

* Created using large-scale real-world sources - OMDB (Open Movie Database), MovieLens and Reddit.
* Consists of ~75K movie entities and ~3.5M training examples.

#### Tasks

##### QA Task

* Answering Factoid Questions without relation to the previous dialogue.
* KB(Knowledge Base) created using OMDB and stored as triplets of the form (Entity, Relation, Entity).
* Question (in Natural Language Form) generated by creating templates using [SimpleQuestions](https://arxiv.org/abs/1506.02075)
* Instead of giving out just 1 response, the system ranks all the answers in order of their relevance.

##### Recommendation Task

* Providing personalised responses to the user via recommendation instead of providing universal facts as in case 1.
* MovieLens dataset with a *user x item* matrix of ratings.
* Statements (for any user) are generated by sampling highly ranked movies by the user and forming a statement about these movies using natural language templates.
* Like the previous case, a list of ranked responses is generated.

##### QA + Recommendation Task

* Maintaining short dialogues involving both factoid and personalised content.
* Dataset consists of short conversations of 3 exchanges (3 from each participant).

##### Reddit Discussion Task

* Identify most likely response is discussions on Reddit.
* Data processed to flatten the potential conversation so that it appears to be a two participant conversation.

##### Joint Task

* Combines all the previous tasks into one single task to test all the skills at once.

#### Models Tested

* **Memory Networks** - Comprises of a memory component that includes both long term memory and short term context.

* **Supervised Embedding Models** - Sum the word embeddings of the input and the target independently and compare them with a similarity metric.

* **Recurrent Language Models** - RNN, LSTM, SeqToSeq

* **Question Answering Systems** - Systems that answer natural language questions by converting them into search queries over a KB.

* **SVD(Singular Value Decomposition)** - Standard benchmark for recommendation.

* **Information Retrieval Models** - Given a message, find the most similar message in the training dataset and report its output or find a most similar response to input directly.

#### Result

##### QA Task

* QA System > Memory Networks > Supervised Embeddings > LSTM

##### Recommendation Task

* Supervised Embeddings > Memory Networks > LSTM > SVD

##### Task Involving Dialog History

* QA + Recommendation Task and Reddit Discussion Task
* Memory Networks > Supervised Embeddings > LSTM

##### Joint Task

* Supervised word embeddings perform very poorly even when using a large number of dimensions (2000 dimensions).
* Memory Networks perform better than embedding models as they can utilise the local context and the long-term memory. But they do not perform as well on standalone QA tasks.

arxiv.org
scholar.google.com

Recurrent Neural Network Regularization
Zaremba, Wojciech and Sutskever, Ilya and Vinyals, Oriol
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 9 years ago

#### Introduction

* The paper explains how to apply dropout to LSTMs and how it could reduce overfitting in tasks like language modelling, speech recognition, image caption generation and machine translation.
* [Link to the paper](https://arxiv.org/abs/1409.2329)

#### [Dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)

* Regularisation method that drops out (or temporarily removes) units in a neural network.
the network, along with all its incoming and outgoing connections
* Conventional dropout does not work well with RNNs as the recurrence amplifies the noise and hurts learning.

#### Regularization

* The paper proposes to apply dropout to only the non-recurrent connections.
* The dropout operator would corrupt information carried by some units (and not all) forcing them to perform intermediate computations more robustly.
* The information is corrupted L+1 times where L is the number of layers and is independent of timestamps traversed by the information.

#### Observation

* In the context of language modelling, image caption generation, speech recognition and machine translation, dropout enables training larger networks and reduces the testing error in terms of perplexity and frame accuracy.

arxiv.org
scholar.google.com

DeepMath - Deep Sequence Models for Premise Selection
Alex A. Alemi and Francois Chollet and Geoffrey Irving and Christian Szegedy and Josef Urban
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.AI, cs.LG, cs.LO
more

[link] Summary by Shagun Sodhani 9 years ago

#### Introduction

* Automated Theorem Proving (ATP) - Attempting to prove mathematical theorems automatically.
* Bottlenecks in ATP:
* **Autoformalization** - Semantic or formal parsing of informal proofs.
* **Automated Reasoning** - Reasoning about already formalised proofs.
* Paper evaluates the effectiveness of neural sequence models for premise selection (related to automated reasoning) without using hand engineered features.
* [Link to the paper](https://arxiv.org/abs/1606.04442)

#### Premise Selection

* Given a large set of premises P, an ATP system A with given resource limits, and a new conjecture C, predict those premises from P that will most likely lead to an automatically constructed proof of C by A

#### Dataset

* Mizar Mathematical Library (MML) used as the formal corpus.
* The premise length varies from 5 to 84299 characters and over 60% if the words occur fewer than 10 times (rare word problem).

#### Approach

* The model predicts the probability that a given axiom is useful for proving a given conjecture.
* Conjecture and axiom sequences are separately embedded into fixed length real vectors, then concatenated and passed to a third network with few fully connected layers and logistic loss.
* The two embedding networks and the joint predictor path are trained jointly.

##### Stage 1: Character-level Models

* Treat premises on character level using an 80-dimensional one hot encoding vector.
* Architectures for embedding:
* pure recurrent LSTM and GRU Network
* CNN (with max pooling)
* Recurrent-convolutional network that shortens the sequence using convolutional layer before feeding it to LSTM.

##### Stage 2: Word-level Models

* MML dataset contains both implicit and explicit definitions.
* To avoid manually encoding the implicit definitions, the entire statement defining an identifier is embedded and the definition embeddings are used as word level embeddings.
* This is better than recursively expanding and embedding the word definition as the definition chains can be very deep.
* Once word level embeddings are obtained, the architecture from Character-level models can be reused.

#### Experiments

##### Metrics

* For each conjecture, the model ranks the possible premises.
* Primary metric is the number of conjectures proved from top-k premises.
* Average Max Relative Rank (AMMR) is more sophisticated measure based on the motivation that conjectures are easier to prove if all their dependencies occur earlier in ranking.
* Since it is very costly to rank all axioms for a conjecture, an approximation is made and a fixed number of random false dependencies are used for evaluating AMMR.

##### Network Training

* Asynchronous distributed stochastic gradient descent using Adam optimizer.
* Clipped vs Non-clipped Gradients.
* Max Sequence length of 2048 for character-level sequences and 500 for word-level sequences.

##### Results

* Best Selection Pipeline - Stage 1 character-level CNN which produces word-level embeddings for the next stage.
* Best models use simple CNNs followed by max pooling and two-stage definition-based def-CNN outperforms naive word-CNN (where word embeddings are learnt in a single pass).

arxiv.org
scholar.google.com

A Neural Conversational Model
Vinyals, Oriol and Le, Quoc V.
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 9 years ago

#### Introduction

* The paper presents a domain agnostic approach for conversational modelling based on [Sequence to Sequence Learning Framework](https://gist.github.com/shagunsodhani/e3608ccf262d6e5a6b537128c917c92c).
* [Link to the paper](http://arxiv.org/abs/1506.05869)

#### Model

* Neural Conversational Model (NCM)
* A Recurrent Neural Network (RNN) reads the input sentence, one token at a time, and predicts the output sequence, one token at a time.
* Learns by backpropagation.
* The model maximises the cross entropy of correct sequence given its context.
* Greedy inference approach where predicted output token is used as input to predict the next output token.

#### Dataset

* IT HelpDesk dataset of conversations about computer related issues.
* OpenSubtitles dataset containing movie conversations.

#### Results

* The paper has reported some samples of conversations generated by the interaction between human actor and the NCM.
* NCM reports lower perplexity as compared to n-grams model.
* NCM outperforms CleverBot in a subjective test involving human evaluators to grade the two systems.

#### Strengths

* Domain-agnostic.
* End-To-End training without handcrafted rules.
* Underlying architecture (Sequence To Sequence Framework) can be leveraged for machine translation, question answering etc.

#### Weakness

* The responses are simple, short and at times inconsistent.
* The objective function of Sequence To Sequence Framework is not designed to capture the objective of conversational models.