Summaries from Association for Computational Linguistics on ShortScience.org

aclweb.org
scholar.google.com

Bidirectional RNN for Medical Event Detection in Electronic Health Records
Jagannatha, Abhyuday N. and Yu, Hong
The Association for Computational Linguistics HLT-NAACL - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 6 years ago

The basic approach is an RNN applied to text to predict a medical event such as an ICD code. It is unclear if the complicated Bi-RNN model is required. 

This has some useful applications such as 
- Adapt old databases
- Correct errors
- Upgrade ICD versions

A simple diagram of an RNN applied to medical next is shown below:


https://i.imgur.com/NPExLqH.png

dx.doi.org
sci-hub
scholar.google.com

Tree-to-Sequence Attentional Neural Machine Translation
Akiko Eriguchi and Kazuma Hashimoto and Yoshimasa Tsuruoka
Association for Computational Linguistics - 2016 via Local CrossRef
Keywords:

[link] Summary by Tim Miller 6 years ago

This work extends sequence-to-sequence models for machine translation by using syntactic information on the source language side. This paper looks at the translation task where English is the source language, and Japanese is the target language. The dataset is the ASPEC corpus of scientific paper abstracts that seem to be in both English and Japanese? (See note below). The trees for the source (English) are generated by running the ENJU parser on the English data, resulting in binary trees, and only the bracketing information is used (no phrase category information).

Given that setup, the method is an extension of seq2seq translation models where they augment it with a Tree-LSTM to do the encoding of the source language. They deviate from a standard Tree-LSTM by running an LSTM across tokens first, and using the LSTM hidden states as the leaves of the tree instead of the token embeddings themselves. Once they have the encoding from the tree, it is concatenated with the standard encoding from an LSTM. At decoding time, the attention for output token $y_j$ is computed across all source tree nodes $i$, which includes $n$ input token nodes and $n-1$ phrasal nodes, as the similarity between the hidden state $s_j$ and the encoding at node $i$, then passed through softmax. Another deviation from standard practice (I believe) is that the hidden state calculations $s_j$ in the decoder are a function of the previous output token $y_{t-1}$, the previous time steps hidden state $s_{j-1}$ and the previous time step's attention-modulated hidden state $\tilde{s}_{j-1}$.

The authors introduce an additional trick for improving decoding performance when translating long sentences, since they say standard length normalization did not work. Their method is to compute a probability distribution over output length given input length, and use this to create an additional penalty term in their scoring function, as the log of the probability of the current output length given input length.

They evaluate using RIBES (not familiar) and BLEU scores, and show better performance than other NMT and SMT methods, and similar to the best performing (non-neural) tree to sequence model.

Implementation: They seem to have a custom implementation in C++ rather than using a DNN library. Their implementation takes one day to run one epoch of training on the full training set. They do not say how many epochs they train for.

Note on data: We have looked at this data a bit for a project I'm working on, and the English sentences look like translations from Japanese. A large proportion of the sentences are written in passive form with the structure "X was Yed" e.g.. "the data was processed, the cells were cultured." This looks to me like they translated subject-dropped Japanese sentences which would have the same word order, but are not actually passive! So that raises for me the question of how representative the source side inputs are of natural English.

aclweb.org
scholar.google.com

Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss
Plank, Barbara and Søgaard, Anders and Goldberg, Yoav
Association for Computational Linguistics - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

Doing POS tagging using a bidirectional LSTM with word- and character-based embeddings. They add an extra component to the loss function – predicting a frequency class for each word, together with their POS tag. Results show that overall performance remains similar, but there’s an improvement in tagging accuracy for low-frequency words.

https://i.imgur.com/nwb8dOC.png

aclweb.org
scholar.google.com

Weakly Supervised Part-of-speech Tagging Using Eye-tracking Data
Barrett, Maria and Bingel, Joachim and Keller, Frank and Søgaard, Anders
Association for Computational Linguistics - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

The paper explores the usefulness of eye tracking for the task of POS tagging. The assumption is that readers skip quickly over closed class words, and fixate longer on rare on ambiguous words.

The experiments are performed on unsupervised POS tagging – a second-order HMM uses constraints on possible tags for each word (based on a dictionary), but no explicit annotated data is required. They show that including the eye tracking features improves performance by quite a bit. Surprisingly, it seems to be better to average eye tracking features over all training tokens of the same type, as opposed to using using the data for each individual token, which means eye tracking is only used during the training stage.

aclweb.org
scholar.google.com

Literal and Metaphorical Senses in Compositional Distributional Semantic Models
Gutiérrez, E. Dario and Shutova, Ekaterina and Marghetis, Tyler and Bergen, Benjamin
Association for Computational Linguistics - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

The paper investigates compositional semantic models specialised for metaphors.

https://i.imgur.com/OnoJK3h.png

They construct a dataset of 8592 adjective-noun phrases, covering 23 different adjectives, annotated for being metaphorical or literal. They then train compositional models to predict the phrase vector based on the noun vector, as a linear combination with an adjective-specific weight matrix. They show that it’s better to learn separate adjective matrices for literal and metaphorical uses of each adjective, even though the amount of training data is smaller.

aclweb.org
scholar.google.com

Named Entity Recognition for Novel Types by Transfer Learning
Qu, Lizhen and Ferraro, Gabriela and Zhou, Liyuan and Hou, Weiwei and Baldwin, Timothy
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

The authors tackle the problem of domain adaptation for NER, where the label set of the target domain is different from the source domain.

They first train a CRF model on the source domain. Next, they train a LR classifier to predict labels in the target domain, based on predicted label scores from the model. Finally, the weights from the classifier are used to initialise another CRF model, which is then fine-tuned on the target domain data.

https://i.imgur.com/zwSB7qN.png

aclweb.org
scholar.google.com

Black Holes and White Rabbits: Metaphor Identification with Visual Features
Shutova, Ekaterina and Kiela, Douwe and Maillard, Jean
The Association for Computational Linguistics HLT-NAACL - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

They build a system for detecting metaphors (“blind alley”, “honest meal”, etc) from literal word pairs.

https://i.imgur.com/Bv1gIb2.png

Annotated metaphor examples from Tsvetkov et al. (2014), used in this work.

The basic system uses word embedding similarity – cosine between the word embeddings. Then they explore variations using phrase embeddings, cos(phrase-word2, word2), which is similar to the operations with word regularities by Mikolov.

Finally, they create vector representations for words and phrases using visual information. The words are used as queries in Google Image Search, and the returned images are passed through an image detection network in order to obtain vector representations. The best final system performs the task separately using linguistic and visual vectors, and then combines the resulting scores.

aclweb.org
scholar.google.com

Numerically Grounded Language Models for Semantic Error Correction
Spithourakis, Georgios P. and Augenstein, Isabelle and Riedel, Sebastian
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

They create an LSTM neural language model that 1) has better handling of numerical values, and 2) is conditioned on a knowledge base.

https://i.imgur.com/Rb6V1Hy.png

First the the numerical value each token is given as an additional signal to the network at each time step. While we normally represent token “25” as a normal word embedding, we now also have an extra feature with numerical value float(25). Second, they condition the language model on text in a knowledge base. All the information in the KB is converted to a string, passed through an LSTM and then used to condition the main LM.

They evaluate on a dataset of 16,003 clinical records which come paired with small KB tuples of 20 possible attributes. The numerical grounding helps quite a bit, and the best results are obtained when the KB conditioning is also added.

aclweb.org
scholar.google.com

Variational Neural Machine Translation
Zhang, Biao and Xiong, Deyi and Su, Jinsong and Duan, Hong and Zhang, Min
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

They start with the neural machine translation model using alignment, by Bahdanau et al. (2014), and add an extra variational component.

https://i.imgur.com/6yIEbDf.png

The authors use two neural variational components to model a distribution over latent variables z that captures the semantics of a sentence being translated. First, they model the posterior probability of z, conditioned on both input and output. Then they also model the prior of z, conditioned only on the input. During training, these two distributions are optimised to be similar using Kullback-Leibler distance, and during testing the prior is used. They report improvements on Chinese-English and English-German translation, compared to using the original encoder-decoder NMT framework.

aclweb.org
scholar.google.com

Extracting token-level signals of syntactic processing from fMRI - with an application to PoS induction
Bingel, Joachim and Barrett, Maria and Søgaard, Anders
Association for Computational Linguistics - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

They incorporate fMRI features into POS tagging, under the assumption that reading semantically/functionally different words will activate the brain in different ways. For this they use a dataset of fMRI recordings, where the subjects were reading a chapter of Harry Potter. The main issue is that fMRI has very low temporal resolution – there is only one fMRI reading per 4 tokens, and in general it takes around 4-14 seconds for something to show up in fMRI. Nevertheless, they construct token-level vectors by using a Gaussian weighted average, integrate them into an unsupervised POS tagger, and show that it is able to improve performance.

https://i.imgur.com/TU60N6w.png

aclweb.org
scholar.google.com

Vision and Feature Norms: Improving automatic feature norm learning through cross-modal maps
Bulat, Luana and Kiela, Douwe and Clark, Stephen
The Association for Computational Linguistics HLT-NAACL - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

The task is to predict feature norms – object properties, for example is_yellow and is_edible for the word banana. They experiment with adding in image recognition features, in addition to using distributional word vectors.

An input word is used to retrieve 10 images from Google, these are passed through an ImageNet classifier to get feature vectors, and then averaged to get a vector representation for that word. A supervised model (partial least-squares regression) is then trained to predict vectors of feature norms based on the input vectors (image-based, distributional, or a combination). Including the image information helps quite a bit, especially for detecting properties like colour and shape.

https://i.imgur.com/4TYKhvm.png

Examples of predicted feature norms using the visual features.

aclweb.org
scholar.google.com

On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems
Su, Pei-Hao and Gasic, Milica and Mrksic, Nikola and Rojas-Barahona, Lina Maria and Ultes, Stefan and Vandyke, David and Wen, Tsung-Hsien and Young, Steve J.
Association for Computational Linguistics - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

The goal is to improve the training process for a spoken dialogue system, more specifically a telephone-based system providing restaurant information for the Cambridge (UK) area. They train a supervised system which tries to predict the success on the current dialogue – if the model is certain about the outcome, the predicted label is used for training the dialogue system; if the model is uncertain, the user is asked to provide a label. Essentially it reduces the amount of annotation that is required, by choosing which examples should be annotated through active learning.

https://i.imgur.com/dWY1EdE.png

The dialogue is mapped to a vector representation using a bidirectional LSTM trained like an autoencoder, and a Gaussian Process is used for modelling dialogue success.

aclweb.org
scholar.google.com

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
Chen, Danqi and Bolton, Jason and Manning, Christopher D.
Association for Computational Linguistics - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

Hermann et al (2015) created a dataset for testing reading comprehension by extracting summarised bullet points from CNN and Daily Mail. All the entities in the text are anonymised and the task is to place correct entities into empty slots based on the news article.

https://i.imgur.com/qeJATKq.png

This paper has hand-reviewed 100 samples from the dataset and concludes that around 25% of the questions are difficult or impossible to answer even for a human, mostly due to the anonymisation process. They present a simple classifier that achieves unexpectedly good results, and a neural network based on attention that beats all previous results by quite a margin.

aclweb.org
scholar.google.com

Character-based Neural Machine Translation
Costa-Jussà, Marta R. and Fonollosa, José A. R.
Association for Computational Linguistics - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Alexander Jung 7 years ago

  * Most neural machine translation models currently operate on word vectors or one hot vectors of words.
  * They instead generate the vector of each word on a character-level.
    * Thereby, the model can spot character-similarities between words and treat them in a similar way.
    * They do that only for the source language, not for the target language.

### How
  * They treat each word of the source text on its own.
  * To each word they then apply the model from [Character-aware neural language models](https://arxiv.org/abs/1508.06615), i.e. they do per word:
    * Embed each character into a 620-dimensional space.
    * Stack these vectors next to each other, resulting in a 2d-tensor in which each column is one of the vectors (i.e. shape `620xN` for `N` characters).
    * Apply convolutions of size `620xW` to that tensor, where a few different values are used for `W` (i.e. some convolutions cover few characters, some cover many characters).
    * Apply a tanh after these convolutions.
    * Apply a max-over-time to the results of the convolutions, i.e. for each convolution use only the maximum value.
    * Reshape to 1d-vector.
    * Apply two highway-layers.
    * They get 1024-dimensional vectors (one per word).
    * Visualization of their steps:
      * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Character-based_Neural_Machine_Translation__architecture.jpg?raw=true "Architecture")
  * Afterwards they apply the model from [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) to these vectors, yielding a translation to a target language.
  * Whenever that translation yields an unknown target-language-word ("UNK"), they replace it with the respective (untranslated) word from the source text.

### Results
  * They the German-English [WMT](http://www.statmt.org/wmt15/translation-task.html) dataset.
  * BLEU improvemements (compared to neural translation without character-level words):
    * German-English improves by about 1.5 points.
    * English-German improves by about 3 points.
  * Reduction in the number of unknown target-language-words (same baseline again):
    * German-English goes down from about 1500 to about 1250.
    * English-German goes down from about 3150 to about 2650.
  * Translation examples (Phrase = phrase-based/non-neural translation, NN = non-character-based neural translation, CHAR = theirs):
    * ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Character-based_Neural_Machine_Translation__examples.jpg?raw=true "Examples")

aclweb.org
scholar.google.com

Cross-lingual Models of Word Embeddings: An Empirical Comparison
Upadhyay, Shyam and Faruqui, Manaal and Dyer, Chris and Roth, Dan
Association for Computational Linguistics - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Tim Miller 7 years ago

This paper looks at four different ways of training cross-lingual embeddings and evaluates them on downstream tasks to investigate which types of learning are suitable for which tasks. Since they require different amounts and quality of supervision this is important to understand. If it only works when there are word-level alignments then cross-lingual embeddings won't help with low-resource languages. On the other hand, if methods trained by comparable corpora only are effective for some downstream tasks, then we have some hope for low resource languages.

The four methods are: 1) a skip-gram like method (Biskip) that uses word-aligned corpora, replacing words for aligned words in the target language, and predicting the source words in the context. This requires sentence aligned corpora, and I believe the word alignments are automatic given that. 2) A model that computes sentence embeddings from component word embeddings and optimizes the loss function to minimize the difference between aligned sentences. (BiCVM) This of course reqiures aligned sentences as well. 3) A projection-based method (BiCCA) that takes two monolingual (independent) word vectors and learns projections into a shared space. They use a translation dictinoary to find the subset of words that align, and use CCA to learn a mapping that respects those alignment. This projection can then be applied to all words in the dictionary. This method does not require similar corpora. 4) A method that uses comparable corpora (similar documents in each language, ala wikipedia) to create pseudo-documents that randomly samples words from each languages document. Once these documents are created they train with word2vec skipgram.

These methods are evaluated on a few NLP tasks, including monolingual word similarity, cross-lingual dictinoary induction, cross-lingual document classification, and cross-lingual dependency parsing.
Of the models that require sentence alignments, BiSkip usually beats BiCVM. BiSkip is best for most of the semantic tasks. BiCCA is as good or better for dependency parsing, suggesting that cheaper methods might be ok for syntactic tasks. One major caveat is that the Chinese dependency parsing results are terrible, meaning that this style of training for dependency parsers probably only works when the languages have similar structure. So the benefits to parsing low resource languages may be minimal even though the supervision required to create the embeddings is low.

arxiv.org
arxiv-vanity.com
scholar.google.com

SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; A new dataset of ~100k questions and answers based on ~500 articles from Wikipedia. Both questions and answers were collected using crowdsourcing. Answers are of various types: 20% dates and numbers, 32% proper nouns, 31% noun phrase answers and 16% other phrases. Humans achieve an F1 score of 86%, and the proposed Logistic Regression model gets 51%. It does well on simple answers but struggles with more complex types of reasoning. Tge data set is publicly available at https://stanford-qa.com/.

#### Key Points

- System must select answers from all possible spans in a passage. $O(N^2)$ possibilities for N tokens in passage.
- Answers are ambiguous. Humans achieve 77% on exact match and 86% on F1 (overlap based). Humans would probably achieve close to 100% if the answer phrases were unambiguous.
- Lexicalized and dependency tree path features are most important for the LR model
- Model performs best on dates and numbers, single tokens, and categories with few possible candidates

arxiv.org
arxiv-vanity.com
scholar.google.com

Neural Machine Translation with Recurrent Attention Modeling
Zichao Yang and Zhiting Hu and Yuntian Deng and Chris Dyer and Alex Smola
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.NE, cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The standard attention model does not take into account the "history" of attention activations, even though this should be a good predictor of what to attend to next. The authors augment a seq2seq network with a dynamic memory that, for each input, keep track of an attention matrix over time. The model is evaluated on English-German and Englih-Chinese NMT tasks and beats competing models.

#### Notes

- How expensive is this, and how much more difficult are these networks to train?
- Sequentiallly attending to neighboring words makes sense for some language pairs, but for others it doesn't. This method seems rather restricted because it only takes into account a window of k time steps.

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning to Translate in Real-time with Neural Machine Translation
Jiatao Gu and Graham Neubig and Kyunghyun Cho and Victor O. K. Li
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.LG
more

[link] Summary by Denny Britz 8 years ago

The authors propose a framework where a Reinforcement Learning agents makes decisions of reading the next input words or producing the next output word to trade off translation quality and time delay (caused by read operations). The reward function is based on both quality (BLEU score) and delay (various metrics and hyperparameters). The authors use Policy Gradient to optimize the model, which is initialized from a pre-trained translation model. They apply to approach to WMT'15 EN-DE and EN-RU translation and show that the model increases translation quality in all settings and is able to trade off effectively between quality and delay.

arxiv.org
arxiv-vanity.com
scholar.google.com

Natural Language Comprehension with the EpiReader
Adam Trischler and Zheng Ye and Xingdi Yuan and Kaheer Suleman
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors prorpose the "EpiReader" model for Question Answering / Machine Comprehension. The model consists of two modules: An Extractor that selects answer candidates (single words) using a Pointer network, and a Reasoner that rank these candidates by estimating textual entailment. The model is trained end-to-end and works on cloze-style questions. The authors evaluate the model on CBT and CNN datasets where they beat Attention Sum Reader and MemNN architectures.


#### Notes

- In most architectures, the correct answer is among the top5 candidates 95% of the time.
- Soft Attention is a problem in many architectures. Need a way to do hard attention.

arxiv.org
arxiv-vanity.com
scholar.google.com

Incorporating Copying Mechanism in Sequence-to-Sequence Learning
Jiatao Gu and Zhengdong Lu and Hang Li and Victor O. K. Li
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.AI, cs.LG, cs.NE
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors introduce CopyNet, a variation on the seq2seq that incorporates a "copying mechanism". With this mechanism, the effective vocabulary is the union of the standard vocab and the words in the current source sentence. CopyNet predicts words based on a mixed probability of the standard attention mechanism and a new copy mechanism. The authors show empirically that on toy and summarization tasks CopNet behaves as expected: The decoder is dominated by copy mode when it tries to replicate something from the source.

arxiv.org
arxiv-vanity.com
scholar.google.com

Neural Tree Indexers for Text Understanding
Tsendsuren Munkhdalai and Hong Yu
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.LG, stat.ML
more

[link] Summary by mrdrozdov 8 years ago

### General Approach

The Neural Tree Indexer (NTI) approach succeeded to reach 87.3\% test accuracy on SNLI. Here I'll attempt to clearly describe the steps involved based on the publication [1] and open sourced codebase [2].

NTI is a method to apply attention over a tree, specifically applied to sentence pairs. There are three main steps, each giving an incrementally more expressive representation of the input. It's worth noting that the tree is a full binary tree, so sentence lengths are padded to a factor of 2. In this case, the padded length used is $2^5 = 32$.

- **Sequence Encoding.** Run an RNN over your sentence to get new hidden states for each element.

$$h_t = f_1^{rnn}(i_t, h_{t-1})$$
    
- **Tree Encoding.** Using the hidden states from the previous step, use a variant of TreeLSTM to combine leaves until you have a single hidden state representing the entire sentence. Keep all of the intermediary hidden states for the next step.
    
$$ h_t^{tree} = f^{tree}(h_l^{tree},h_r^{tree})$$
    
- **Attention on Opposite Tree.** Until now we've only been describing how to encode a single sentence. When incorporating attention, we attend on the opposite tree by using the hidden states from the previous step. For instance, here is how we'd encode the premise (where the $p,h$ superscripts denote the premise or hypothesis, and $\vec{h}^{h,tree}$ denotes all of the hidden states of the non-attended hypothesis tree.:
    
$$h_t^p = f_1^{rnn}(i_t^p, h_{t-1}^p) \\
    h_t^{p,tree} = f^{tree}(h_l^{p,tree},h_r^{p,tree}) \\
    i_t^{p,attn} = f^{attn}(h_t^{p,tree}, \vec{h}^{h,tree}) \\
    h_t^{p,attn} = f_2^{rnn}(i_t^{p,attn}, h_{t-1}^{p,attn})
$$

### Datasets

NTI was evaluated on three datasets. Some variant of the model achieved state-of-the-art in some category for each dataset:

- SNLI [3]: Sentence Pair Classification.
- WikiQA [4]: Answer Sentence Selection.
- Stanford Sentiment TreeBank (SST) [5]: Sentence Classification.

### Implementation Details

- Batch size is $32$ pairs (so $32$ of each premise and hypothesis).
- Tree is full binary tree with $2^5 = 32$ leaves.
- All sentences are padded left to length $32$, matching the full binary tree.
- Steps 1 (sentence encoding) runs on all sentence simultaneously. So is Step 2 (tree encoding). Step 3 (attention) is done first on the premise, then on the hypothesis.
- The variant of TreeLSTM used is S-LSTM. It's available as a standard function in Chainer.
- Dropout is applied liberally in each step. The keep rate is fixed at $80\%$.
- MLP has $1$ hidden layer with dimension $1024$. Dimensions of the entire MLP are: $(2 \times H) \times 1024 \times 3$. $H$ is the size of the hidden states and is $300$.
- Uses Chainer's Adam optimizer with $\alpha=0.0003,\beta_1=0.9,\beta_2=0.999,\epsilon=10^{-8}$. Gradient clipping using L2 norm of $40$. Parameters periodically scaled by $0.00003$ (weight decay).
- Weights are initialized uniformly random between $-0.1$ and $0.1$.

[1]: https://arxiv.org/abs/1607.04492
[2]: https://bitbucket.org/tsendeemts/nti/overview
[3]: nlp.stanford.edu/projects/snli/
[4]: https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/
[5]: http://www.socher.org/index.php/Main/SemanticCompositionalityThroughRecursiveMatrix-VectorSpaces

arxiv.org
arxiv-vanity.com
scholar.google.com

Bag of Tricks for Efficient Text Classification
Armand Joulin and Edouard Grave and Piotr Bojanowski and Tomas Mikolov
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* Introduces fastText, a simple and highly efficient approach for text classification.
* At par with deep learning models in terms of accuracy though an order of magnitude faster in performance. 
* [Link to the paper](http://arxiv.org/abs/1607.01759v3)
* [Link to code](https://github.com/facebookresearch/fastText)

#### Architecture

* Built on top of linear models with a rank constraint and a fast loss approximation.
* Start with word representations that are averaged into text representation and feed them to a linear classifier.
* Think of text representation as a hidden state that can be shared among features and classes.
* Softmax layer to obtain a probability distribution over pre-defined classes.
* High computational complexity $O(kh)$, $k$ is the number of classes and $h$ is dimension of text representation.

##### Hierarchial Softmax

* Based on Huffman Coding Tree
* Used to reduce complexity to $O(hlog(k))$
* Top T results (from the tree) can be computed efficiently $O(logT)$ using a binary heap.

##### N-gram Features

* Instead of explicitly using word order, uses a bag of n-grams to maintain efficiency without losing on accuracy.
* Uses [hashing trick](https://arxiv.org/pdf/0902.2206.pdf) to maintain fast and memory efficient mapping of the n-grams.

#### Experiments

##### Sentiment Analysis

* fastText benefits by using bigrams.
* Outperforms [char-CNN](http://arxiv.org/abs/1502.01710v5) and [char-CRNN](http://arxiv.org/abs/1602.00367v1) and performs a bit worse than [VDCNN](http://arxiv.org/abs/1606.01781v1).
* Order of magnitudes faster in terms of training time.
* Note: fastText does not use pre-trained word embeddings.

##### Tag Prediction

* fastText with bigrams outperforms [Tagspace](http://emnlp2014.org/papers/pdf/EMNLP2014194.pdf).
* fastText performs upto 600 times faster at test time.

arxiv.org
arxiv-vanity.com
scholar.google.com

Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge
Nicholas Locascio and Karthik Narasimhan and Eduardo DeLeon and Nate Kushman and Regina Barzilay
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.AI
more

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* Task of translating natural language queries into regular expressions without using domain specific knowledge.
* Proposes a methodology for collecting a large corpus of regular expressions to natural language pairs.
* Reports performance gain of 19.6% over state-of-the-art models.
* [Link to the paper](http://arxiv.org/abs/1608.03000v1)

#### Architecture

* LSTM based sequence to sequence neural network (with attention)
* Six layers
    * One-word embedding layer
    * Two encoder layers
    * Two decoder layers
    * One dense output layer.
* Attention over encoder layer.
* Dropout with the probability of 0.25.
* 20 epochs, minibatch size of 32 and learning rate of 1 (with decay rate of 0.5)

#### Dataset Generation

* Created a public dataset - **NL-RX** - with 10K pair of (regular expression, natural language) 
* Two step generate-and-paraphrase approach
    * Generate step
        * Use handcrafted grammar to translate regular expressions to natural language.
    * Paraphrase step
        * Crowdsourcing the task of translating the rigid descriptions into more natural expressions.


#### Results

* Evaluation Metric
    * Functional equality check (called DFA-Equal) as same regular expression could be written in many ways.
* Proposed architecture outperforms both the baselines - Nearest Neighbor classifier using Bag of Words (BoWNN) and Semantic-Unify

arxiv.org
arxiv-vanity.com
scholar.google.com

WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia
Daniel Hewlett and Alexandre Lacoste and Llion Jones and Illia Polosukhin and Andrew Fandrianto and Jay Han and Matthew Kelcey and David Berthelot
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* Large scale natural language understanding task - predict text values given a knowledge base.
* Accompanied by a large dataset generated using Wikipedia
* [Link to the paper](http://www.aclweb.org/anthology/P/P16/P16-1145.pdf)

#### Dataset

* WikiReading dataset built using Wikidata and Wikipedia.
* Wikidata consists of statements of the form (property, value) about different items
* 80M statements, 16M items and 884 properties.
* These statements are grouped by items to get (item, property, answer) tuples where the answer is a set of values.
* Items are further replaced by their Wikipedia documents to generate 18.58M statements of the form (document, property, answer).
* Task is to predict answer given document and property.
* Properties are divided into 2 classes:
* **Categorical properties** - properties with a small number of possible answers. Eg gender.
* **Relational properties** - properties with unique answers. Eg date of birth.
* This classification is done on the basis of the entropy of answer distribution.
* Properties with entropy less than 0.7 are classified as categorical properties.
* Answer distribution has a small number of very high-frequency answers (head) and a large number of answers with very small frequency (tail).
* 30% of the answers do not appear in the training set and must be inferred from the document.

#### Models

##### Answer Classification

* Consider WikiReading as classification task and treat each answer as a class label.

###### Baseline

* Linear model over Bag of Words (BoW) features.
* Two BoW vectors computed - one for the document and other for the property. These are concatenated into a single feature vector.

###### Neural Networks Method

* Encode property and document into a joint representation which is fed into a softmax layer.

* **Average Embeddings BoW**

* Average the BoW embeddings for documents and property and concatenate to get joint representation.

* **Paragraph Vectors**

* As a variant of the previous method, encode document as a paragraph vector.

* **LSTM Reader**

* LSTM reads the property and document sequence, word-by-word, and uses the final state as joint representation.

* **Attentive Reader**

* Use attention mechanism to focus on relevant parts of the document for a given property.

* **Memory Networks**

* Maps a property p and list of sentences x1, x2, ...xn in a joint representation by attention over the sentences in the document.

##### Answer Extraction

* For relational properties, it makes more sense to model the problem as information extraction than classification.

* **RNNLabeler**

* Use an RNN to read the sequence of words and estimate if a given word is part of the answer.

* **Basic SeqToSeq (Sequence to Sequence)**

* Similar to LSTM Reader but augmented with a second RNN to decode answer as a sequence of words.

* **Placeholder SeqToSeq**

* Extends Basic SeqToSeq to handle OOV (Out of Vocabulary) words by adding placeholders to the vocabulary.
* OOV words in the document and answer are replaced by placeholders so that input and output sentences are a mixture of words and placeholders only.

* **Basic Character SeqToSeq**

* Property encoder RNN reads the property, character-by-character and transforms it into a fixed length vector.
* This becomes the initial hidden state for the second layer of a 2-layer document encoder RNN.
* Final state of this RNN is used by answer decoder RNN to generate answer as a character sequence.

* **Character SeqToSeq with pretraining**

* Train a character-level language model on input character sequence from the training set and use the weights to initiate the first layer of encoder and decoder.

#### Experiments

* Evaluation metric is F1 score (harmonic mean of precision and accuracy).
* All models perform well on categorical properties with neural models outperforming others.
* In the case of relational properties, SeqToSeq models have a clear edge.
* SeqToSeq models also show a great deal of balance between relational and categorical properties.
* Language model pretraining enhances the performance of character SeqToSeq approach.
* Results demonstrate that end-to-end SeqToSeq models are most promising for WikiReading like tasks.

arxiv.org
arxiv-vanity.com
scholar.google.com

Key-Value Memory Networks for Directly Reading Documents
Alexander Miller and Adam Fisch and Jesse Dodge and Amir-Hossein Karimi and Antoine Bordes and Jason Weston
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Shagun Sodhani 8 years ago

### Introduction

* Knowledge Bases (KBs) are effective tools for Question Answering (QA) but are often too restrictive (due to fixed schema) and too sparse (due to limitations of Information Extraction (IE) systems).
* The paper proposes Key-Value Memory Networks, a neural network architecture based on [Memory Networks](https://gist.github.com/shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e) that can leverage both KBs and raw data for QA.
* The paper also introduces MOVIEQA, a new QA dataset that can be answered by a perfect KB, by Wikipedia pages and by an imperfect KB obtained using IE techniques thereby allowing a comparison between systems using any of the three sources.
* [Link to the paper](https://arxiv.org/abs/1606.03126).

### Related Work

* TRECQA and WIKIQA are two benchmarks where systems need to select the sentence containing the correct answer, given a question and a list of candidate sentences. 
* These datasets are small and make it difficult to compare the  systems using different sources.
* Best results on these benchmarks are reported by CNNs and RNNs with attention mechanism.

### Key-Value Memory Networks

* Extension of [Memory Networks Model](https://gist.github.com/shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e).
* Generalises the way context is stored in memory.
* Comprises of a memory made of slots in the form of pair of vectors $(k_{1}, v_{1})...(k_{m}, v_{m})$ to encode long-term and short-term context.

#### Reading the Memory

* **Key Hashing** - Question, *x* is used to preselect a subset of array $(k_{h1}, v_{h1})...(k_{hN}, v_{hN})$ where the key shares atleast one word with *x* and frequency of the words is less than 1000.
* **Key Addresing** - Each candidate memory is assigned a relevance probability:
    * $p_hi$ = softmax($Aφ_X(x).Aφ_K (k_{hi}))$
    * φ is a feature map of dimension *D* and *A* is a *dxD* matrix.
* **Value Reading** - Value of memories are read by taking their weighted sum using addressing probabilites and a vector *o* is returned.
* $o = sum(p_{hi} A\phi_V(v_{hi}))$
* Memory access process conducted by "controller" neural network using $q = Aφ_X (x)$ as the query.
* Query is updated using
    * $q_2 = R_1 (q+o)$
* Addressing and reading steps are repeated using new $R_i$ matrices to retrieve more pertinent information in subsequent access.
* After a fixed number of hops, H, resulting state of controller is used to compute a final prediction.
* $a = \text{argmax}(\text{softmax}(q_{H+1}^T B\phi_Y (y_i)))$
where $y_i$ are the possible candidate outputs and $B$ is a $dXD$ matrix.
* The network is trained end-to-end using a cross entropy loss, backpropogation and stochastic gradient.
* End-to-End Memory Networks can be viewed as a special case of Key-Value Memory Networks by setting key and value to be the same for all the memories.

#### Variants of Key-Value Memories

* $φ_x$ and $φ_y$ - feature map corresponding to query and answer are fixed as bag-of-words representation.

##### KB Triple

* Triplets of the form "subject relation object" can be represented in Key-Value Memory Networks with subject and relation as the key and object as the value.
* In standard Memory Networks, the whole triplet would have to be embedded in the same memory slot.
* The reversed relations "object is_related_to subject" can also be stored.

##### Sentence Level

* A document can be split into sentences with each sentence encoded in the key-value pair of the memory slot as a bag-of-words.

##### Window Level

* Split the document in the windows of W words and represent it as bag-of-words. 
* The window becomes the key and the central word becomes the value.

##### Window + Centre Encoding

* Instead of mixing the window centre with the rest of the words, double the size of the dictionary and encode the centre of the window and the value using the second dictionary.

##### Window + Title

* Since the title of the document could contain useful information, the word window can be encoded as the key and document title as the value.
* The key could be augmented with features like "_window_" and "_title_" to distinguish between different cases.

### MOVIEQA Benchmark

#### Knowledge Representation

* Doc - Raw documents (from Wikipedia) related to movies.
* KB - Graph-based KB made of entities and relations.
* IE - Performing Information Extraction on Wikipedia to create a KB.
* The QA pairs should be answerable by both raw document and KB so that the three approaches can be compared and the gap between the three solutions can be closed.
* The dataset has more than 100000 QA pairs, making it much larger than most existing datasets.

### Experiments

#### MOVIEQA

##### Systems Compared

* [Bordes et al's QA system](TBD)
* [Supervised Embeddings](TBD)(without KB)
* [Memory Networks](TBD)
* Key-Value Memory Networks

##### Observations

* Key-Value Memory Networks outperforms all methods on all data sources.
* KB > Doc > IE
* The best memory representation for directly reading documents uses "Window Level + Centre Encoding + Title".

##### KB vs Synthetic Document Analysis

* Given KB triplets, construct synthetic "Wikipedia" articles using templates, conjunctions and coreferences to determine the causes for the gap in performance when using KB vs doc.
* Loss in One Template sentences are due to the difficulty of extracting subject, relation and object from the artificial docs. 
* Using multiple templates does not deteriorate performance much. But conjunctions and coreferences cause a dip in performance.

#### WIKIQA

* Given a question, select the sentence (from Wikipedia document) that best answers the question.
* Key-Value Memory Networks outperforms all other solutions though it is only marginally better than LDC and attentive models based on CNNs and RNNs.

arxiv.org
arxiv-vanity.com
scholar.google.com

Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
Antonio Valerio Miceli Barone
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.LG, cs.NE
more

[link] Summary by Jon Gauthier 8 years ago

This is a simple unsupervised method for learning word-level translation
between embeddings of two different languages.

That's right -- unsupervised.

The basic motivating hypothesis is that there should be an isomorphism between
the "semantic spaces" of different languages:

> we hypothesize that, if languages are used to convey thematically similar information in similar contexts, these random processes should be approximately isomorphic between languages, and that this isomorphism can be learned from the statistics of the realizations of these processes, the monolingual corpora, in principle without any form of explicit alignment.

If you squint a bit, you can make the more aggressive claim from this premise
that there should be a nonlinear / MLP mapping between *word embedding spaces*
that gets us the same result.

The author uses the adversarial autoencoder (AAE, from Makhzani last year)
framework in order to enforce a cross-lingual semantic mapping in word
embedding spaces. The basic setup for adversarial training between a source and
a target language:

1. Sample a batch of words from the source language according to the language's
word frequency distribution.
2. Sample a batch of words from the target language according to its word
frequency distribution. (No sort of relationship is enforced between the two
samples here.)
3. Feed the word embeddings corresponding to the source words through an
*encoder* MLP. This corresponds to the standard "generator" in a GAN setup.
4. Pass the generator output to a *discriminator* MLP along with the
target-language word embeddings.
5. Also pass the generator output to a *decoder* which maps back to the source
embedding distribution.
6. Update weights based on a combination of GAN loss + reconstruction loss.

### Does it work?

We don't really know. The paper is unfortunately short on evaluation --- we
just see a few examples of success and failure on a trained model. One easy
evaluation would be to compare accuracy in lexical mapping vs. corpus frequency
of the source word. I would bet that this would reveal the model hasn't done
much more than learn to align a small set of high-frequency words.

arxiv.org
arxiv-vanity.com
scholar.google.com

Sequence-to-Sequence Learning as Beam-Search Optimization
Sam Wiseman and Alexander M. Rush
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.LG, cs.NE, stat.ML
more

1	[link] Summary by Udibr 8 years ago This paper is covered by author in this [talk](https://github.com/udibr/notes/blob/master/Talk%20by%20Sasha%20Rush%20-%20Interpreting%2C%20Training%2C%20and%20Distilling%20Seq2Seq%E2%80%A6.pdf) more less

arxiv.org
arxiv-vanity.com
scholar.google.com

Globally Normalized Transition-Based Neural Networks
Daniel Andor and Chris Alberti and David Weiss and Aliaksei Severyn and Alessandro Presta and Kuzman Ganchev and Slav Petrov and Michael Collins
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.LG, cs.NE
more

[link] Summary by Udibr 8 years ago

[Parsey McParseface](http://github.com/tensorflow/models/tree/master/syntaxnet) is  a parser of English sentences capable of finding parts of speech and dependency parsing. By Michael Collins and google NY.

This paper is more than just about google's data collection and computing powers. The parser uses a feed forward NN, which is much faster than the RNN usually used for parsing. Also the paper is using a global method to solve the label bias problem. This method can be used for many tasks and indeed in the paper it is used also to shorten sentences by throwing unnecessary words.

The label bias problem arises when predicting each label in a sequence using a softmax over all possible label values in each step. This is a local approach but what we are really interested in is a global approach in which the sequence of all labels that appeared in a training example are normalized by all possible sequences. This is intractable so instead a beam search is performed to generate alternative sequences to the training sequence. The search is stopped when the training sequence drops from the beam or ends. The different beams with the training sequence are then used to compute the global loss. 

Similar method is used in [seq2seq by Sasha Rush](http://arxiv.org/pdf/1606.02960.pdf) and  [talk](https://github.com/udibr/notes/blob/master/Talk%20by%20Sasha%20Rush%20-%20Interpreting%2C%20Training%2C%20and%20Distilling%20Seq2Seq%E2%80%A6.pdf)

arxiv.org
scholar.google.com

Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism
Firat, Orhan and Cho, KyungHyun and Bengio, Yoshua
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors train a *single* Neural Machine Translation model that can translate between N*M language pairs, with a parameter spaces that grows linearly with the number of languages. The model uses a single attention mechanism shared across encoders/decoders. The authors demonstrate the the model performs particularly well for resource-constrained languages, outperforming single-pair models trained on the same data.

#### Key Points

- Attention mechanism: Both encoder and decoder output attention-specific vectors, which are then combined. Thus, adding a new source/target language does not result in a quadratic explosion of parameters.
- Bidirectional RNN, 620-dimensional embeddings, GRU with 1k units, 1k affine layer tanh. Adam, minibatch 60 examples. Only use sentence up to length 50.
- Model clearly outperforms single-pair models when parallel corpora are constrained to small size. Not so much for large corpora.
- The single model doesn't fit on a GPU.
- Can in theory be used to translate between pairs that didn't have a bilingual training corpus, but the authors don't evaluate this in the paper.
- Main difference to "Multi-task Sequence to Sequence Learning": Uses attention mechanism

#### Notes / Questions

- I don't see anything that would force the encoders to map sequences of different languages into the same representation (as the authors briefly mentioned). Perhaps it just encodes language-specific information that the decoders can use to decide which source language it was?

arxiv.org
scholar.google.com

Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models
Luong, Minh-Thang and Manning, Christopher D.
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors train a word-level NMT where UNK tokens in both source and target sentence are replaced by character-level RNNs that produce word representations. The authors can thus train a fast word-based system that still generalized that doesn't produce unknown words. The best system achieves a new state of the art BLEU score of 19.9 in WMT'15 English to Czech translation.


#### Key Points

- Source Sentence: Final hidden state of character-RNN is used as word representation.
- Source Sentence: Character RNNs always initialized with 0 state to allow efficient pre-training 
- Target: Produce word-level sentence including UNK first and then run the char-RNNs
- Target: Two ways to initialize char-RNN: With same hidden state as word-RNN (same-path), or with its own representation (separate-path)
- Authors find that attention mechanism is critical for pure character-based NMT models


#### Notes

- Given that the authors demonstrate the potential of character-based models, is the hybrid approach the right direction? If we had more compute power, would pure character-based models win?

arxiv.org
scholar.google.com

Latent Predictor Networks for Code Generation
Ling, Wang and Grefenstette, Edward and Hermann, Karl Moritz and Kociský, Tomás and Senior, Andrew and Wang, Fumin and Blunsom, Phil
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a conditional generative model of text, where text can be generated either one character at a time or by copying some full chunks of character taken directly from the input into the output. At each step of the generation, the model can decide which of these two modes of generation to use, mixing them as needed to generate a correct output. They refer to this structure for generation as Latent Predictor Networks \cite{conf/nips/VinyalsFJ15}. The character-level generation part of the model is based on a simple output softmax over characters, while the generation-by-copy component is based on a Pointer Network architecture. Critically, the authors highlight that it is possible to marginalize over the use of either types of components by dynamic programming as used in semi-Markov models \cite{conf/nips/SarawagiC04}.

One motivating application is machine translation, where the input might contain some named entities that should just be directly copied at the output. However, the authors experiment on a different problem, that of generating code that would implement the action of a card in the trading card games Magic the Gathering and Hearthstone. In this application, copying is useful to do things such as copy the name of the card or its numerically-valued effects.

In addition to the Latent Predictor Network structure, the proposed model for this application includes a slightly adapted form of soft-attention as well as character-aware word embeddings as in \cite{conf/emnlp/LingDBTFAML15} Also, the authors experiment with a compression procedure on the target programs, that can help in reducing the size of the output space.

Experiments show that the proposed neural network approach outperforms a variety of strong baselines (including systems based on machine translation or information retrieval).