Deep contextualized word representations on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Deep contextualized word representations
Matthew E. Peters and Mark Neumann and Mohit Iyyer and Matt Gardner and Christopher Clark and Kenton Lee and Luke Zettlemoyer
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CL
more

Summaries/Notes 2

[link] Summary by mnoukhov 7 years ago

This paper introduces a deep universal word embedding based on using a bidirectional LM (in this case, biLSTM). First words are embedded with a CNN-based, character-level, context-free, token embedding into $x_k^{LM}$ and then each sentence is parsed using a biLSTM, maximizing the log-likelihood of a word given it's forward and backward context (much like a normal language model). 

The innovation is in taking the output of each layer of the LSTM ($h_{k,j}^{LM}$ being the output at layer $j$) 
$$
\begin{align}
R_k &= \{x_k^{LM}, \overrightarrow{h}_{k,j}^{LM}, \overleftarrow{h}_{k,j}^{LM} | j = 1 \ldots L \} \\
&= \{h_{k,j}^{LM} | j = 0 \ldots L \}
\end{align}
$$
and allowing the user to learn a their own task-specific weighted sum of these hidden states as the embedding:
$$
ELMo_k^{task} = \gamma^{task} \sum_{j=0}^L s_j^{task} h_{k,j}^{LM}
$$

The authors show that this weighted sum is better than taking only the top LSTM output (as in their previous work or in CoVe) because it allows capturing syntactic information in the lower layer of the LSTM and semantic information in the higher level. Table below shows that the second layer is more useful for the semantic task of word sense disambiguation, and the first layer is more useful for the syntactic task of POS tagging.

https://i.imgur.com/dKnyvAa.png

On other benchmarks, they show it is also better than taking the average of the layers (which could be done by setting $\gamma = 1$)
https://i.imgur.com/f78gmKu.png

To add the embeddings to your supervised model,  ELMo is concatenated with your context-free embeddings, $\[ x_k; ELMo_k^{task} \]$. It can also be concatenated with the output of your RNN model $\[ h_k; ELMo_k^{task} \]$ which can show improvements on the same benchmarks

https://i.imgur.com/eBqLe8G.png

Finally, they show that adding ELMo to a competitive but simple baseline gets SOTA (at the time) on very many NLP benchmarks

https://i.imgur.com/PFUlgh3.png

It's all open-source and there's a tutorial [here](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md)

Your comment:

[link] Summary by CodyWild 7 years ago

This paper builds on the paper "Learned in Translation: Contextualized Word Vectors" , which learned contextualized word representations by using the sequence of encodings generated by a Bidirectional LSTM as the representation of the sequence of input words. This paper says “if we're learning a deep LSTM, ie one with more than one layer, why should we use only the last layer that it produces as the representation of the word?”. This paper instead suggests that it could be valuable for transfer learning if each task can learn a weighting of layer encodings that is most valuable for that task. In a prime example of “your model is a special case of my model,” they note that this framework can easily learn the approach of only using the final encoding layer by just only giving that layer a non-zero weight. As intuition for why this might be a valuable thing to do: different layers tend to capture different levels of meaning, with lower layers more likely to capture part of speech information, and higher layers more likely to capture more rich semantic context. 

https://i.imgur.com/s8Qn6YY.png

One difficulty in comparing this paper directly to the “take the top layer encoding from a LSTM” paper is that they were trained on different problems: the top layer paper learned using a machine translation objective, where, by contrast, this one learns by using a much simpler language model. Here, a simple language model means a RNN that is trained to predict the next word, given the hidden state built up over all prior words. Because we want to pull in word context from both directions, this is’t just a LSTM but a bidirectional LSTM, which - surprise surprise - tries to pick the word *before* a word in a sentence by using all of the words that come after it. This has the advantage of not requiring parallel data, the way machine translation does, but also makes it difficult to make direct comparisons to prior work that isolate the effect of multi-layer combination, as separate from the switch between machine translation and direct language modeling. 

Although this is likely also a benefit you see with just top-layer contextual vectors, it is interesting to examine the attached table, and look at how effectively the model is able to learn different representations of the word “play” depending on the context in which it appears; in each case, the nearest neighbor of a context of “play” is a sentence in which the word is used in the same context.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private