#### Introduction
* Presents WikiQA - a publicly available set of question and sentence pairs for open-domain question answering.
* [Link to the paper](https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/)
#### Dataset
* 3047 questions sampled from Bing query logs.
* Each question associated with a Wikipedia page.
* All sentences in the summary paragraph of the page become the candidate answers.
* Only 1/3rd questions have a correct answer in the candidate answer set.
* Solutions crowdsourced through MTurk like platform.
* Answer sentences are associated with *answer phrases* (shortest substring of a sentence that answers the question) though this annotation is not used in the experiments reported by the paper.
#### Other Datasets
* [QASent datset](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf)
* Uses questions from TREC-QA dataset (questions from both query logs and human editors) and selects sentences which share at least one non-stopword from the question.
* Lexical overlap makes QA task easier.
* Does not support evaluating for *answer triggering* (detecting if the correct answer even exists in the candidate sentences).
#### Experiments
##### Baseline Systems
* **Word Count** - Counts the number of non-stopwords common to question and answer sentences.
* **Weighted Word Count** - Re-weight word counts by the IDF values of the question words.
* **[LCLR](https://www.microsoft.com/en-us/research/publication/question-answering-using-enhanced-lexical-semantic-models/)** - Uses rich lexical semantic features like WordNet and vector-space lexical semantic models.
* **Paragraph Vectors** - Considers cosine similarity between question vector and sentence vector.
* **Convolutional Neural Network (CNN)** - Bigram CNN model with average pooling.
* **PV-Cnt** and **CNN-Cnt** - Logistic regression classifier combining PV (and CNN) models and Word Count models.
##### Metrics
* MAP and MRR for answer selection problem.
* Precision, recall and F1 scores for answer triggering problem.
#### Observations
* CNN-cnt outperforms all other models on both the tasks.
* Three additional features, namely the length of the question (QLen), the length of sentence (SLen), and the class of the question (QClass) are added to track question hardness and sentence comprehensiveness.
* Adding QLen improves performance significantly while adding SLen (QClass) improves (degrades) performance marginally.
* For the same model, the performance on the WikiQA dataset is inferior to that on the QASent dataset.
* Note: The dataset is very small to train end-to-end networks.