You May Not Need Attention on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

You May Not Need Attention
Ofir Press and Noah A. Smith
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CL
more

Summaries/Notes 2

[link] Summary by CodyWild 5 years ago

I admit it - the title of the paper pulled me in, existing as it does in the chain of weirdly insider-meme papers, starting with Vaswani’s 2017 “Attention Is All You Need”. That paper has been hugely influential, and the domain of machine translation as a whole has begun to move away from processing (or encoding) source sentences with recurrent architectures, to instead processing them using self-attention architectures. (Self-attention is a little too nuanced to go into in full depth here, but the basic idea is: instead of summarizing varying-length sequences by feeding each timestep into a recurrent loop and building up hidden states, generate a query, and weight the contribution of each timestep to each “hidden state” based on the dot product between that query and each timestep’s representation). There has been an overall move in recent years away from recurrence being the accepted default for sequence data, and towards attention and (often dilated) convolution taking up more space. I find this an interesting set of developments, and had hopes that this paper would address that arc. 

However, unfortunately, the title was quite out of sync with the actual focus of the paper - instead of addressing the contribution of attention mechanisms vs recurrence, or even directly addressing any of the particular ideas posed in the “Attention is All You Need” paper, this YMNNA instead takes aim at a more fundamental structural feature of translation models: the encoder/decoder structure. The basic idea of an encoder/decoder approach, in a translation paradigm, is that you process the entire source sentence before you start generating the tokens of the predicted, other-language target sentence. Initially, this would work by running a RNN over the full sentence, and using the final hidden state of that RNN as a compressed representation of the full sentence. More recently, the norm has been to use multiple layers of RNN, and to represent the source sentence via the hidden states at each timestep (so: as many hidden states as you have input tokens), and then at each step in the decoding process, calculate an attention-weighted average over all of those hidden states. But, fundamentally, both of these structures share the fact that some kind of global representation is calculated and made available to the decoder before it starts predicting words in the output sentence. 

This makes sense for a few reasons. First, and most obviously, languages aren’t naturally aligned with one another, in the sense of one word in language X corresponding to one word in language Y. It’s not possible for you to predict a word in the target sentence if its corresponding source sentence token has not yet been processed. For another, there can be contextual information from the sentence as a whole that can disambiguate between different senses of a word, which may have different translations - think Teddy Bear vs Teddy Roosevelt. However, this paper poses the question: how well can you do if you throw away this structure, and build a model that continually emits tokens of the target sequence as it reads in the source sentence? Using a recurrent model, the YMNNA model takes, at each timestep, the new source token, the previous target token, and the prior hidden state from the last time step of the RNN, and uses that to predict a token.

However, that problem mentioned earlier - of languages not natively being aligned such that you have the necessary information to predict a word by the time you get to its point in the target sequence - hasn’t gone away, and is still alive and kicking. This paper solves it in a pretty unsatisfying way - by relying on an external tool, fast-align, that does the work of guessing which source tokens correspond to which target tokens, and inserting buffer tokens into the target, so that you don’t need to predict a word until it’s already been seen by the source-reading RNN; until then you just predict the buffer. This is fine and clever as a practical heuristic, but it really does make their comparisons against models that do alignment and translation jointly feel a little weak. 

https://i.imgur.com/Gitpxi7.png

An additional heuristic that makes the overall narrative of the paper less compelling is the fact that, in order to get comparable performance to their baselines, they padded the target sequences with between 3 and 5 buffer tokens, meaning that the models learned that they could process the first 3-5 tokens of the sentence before they need to start emitting the target. Again, there’s nothing necessarily wrong with this, but, since they are consuming a portion of the sentence before they start emitting translations, it does make for a less stark comparison with the “read the whole sentence” encoder/decoder framework. 

A few other frustrations, and notes from the paper’s results section: 
As earlier mentioned, the authors don’t actually compare their work against the “Attention is All You Need” paper, but instead to a 2014 paper. This is confusing both in terms of using an old baseline for SOTA, and also in terms of their title implicitly arguing they are refuting a paper they didn’t compare to 
Comparing against their old baseline, their eager translation model performs worse on all sentences less than 60 tokens in length (which makes up the vast majority of all the sentences there are), and only beats the baseline on sentences > 60 tokens in length 
Additionally, they note as a sort of throwaway line that their model took almost three times as long to train as the baseline, with the same amount of parameters, simply because it took so much longer to converge. 

Being charitable, it seems like there is some argument that an eager translation framework performs well on long sentences, and can do so while only keeping a hidden state in memory, rather than having to keep the hidden states for each source sequence element around, like attention-based decoders require. However, overall, I found this paper to be a frustrating let-down, that used too many heuristics and hacks to be a compelling comparison to prior work.

Not all research advances are made with state of the art models. Sometimes new methods are introduced that are slow, parameter-heavy or have some other deficiency. Such ideas are not meant to be introduced into production servers, they are meant to spark a discussion, which could then lead the research community to discover new ideas which will one day be used to improve state of the art models. -------------------------------------------------------------------------------------------- This paper does not try to present a state of the art model. This paper was written to question two commonly held beliefs- that separate encoding/decoding are necessary in NMT and that attention is a required component of all NMT models. If you look at the many NMT models published in the last two years, all of them contain at least one of these properties, and the vast majority contain both. -------------------------------------------------------------------------------------------- This paper is not claiming that we should just throw away all of that progress. We just want to show the research community that it is possible to build vastly different translation models that can still perform well. This specific model that we presented also has the advantage of being extremely simple. Usually new NMT papers introduce a new mechanism, they make an existing model more complex. Here we want to step forward by showing an NMT model that is simpler than almost anything else that came before it. -------------------------------------------------------------------------------------------- Yes, its not state of the art. Yes, it is trained using external alignment data. Yes, it requires special preprocessing. But it shatters two widely held beliefs. It also uses a constant amount of memory. And it works well on long sequences, which are known to be difficult for attention models. We firmly believe that these advantages far outweigh the disadvantages of our model. And *that* is why we posted this paper. We think the community should start thinking more about models that don’t use attention. Or models that have combined encoding/decoding. Or maybe just take the eagerness property from our model and apply it to an attention model. These research directions could lead to an improvement in performance of state of the art models on long sequences. Or they could be used to lower the memory requirements of simultaneous translation systems. Interesting methods aren’t found only in state of the art models.

Your comment:

[link] Summary by Ofir Press 5 years ago

An attention mechanism and a separate encoder/decoder are two properties of almost every single neural translation model. The question asked in this paper is- how far can we go without attention and without a separate encoder and decoder? And the answer is- pretty far! The model presented preforms just as well as the attention model of Bahdanau on the four language directions that are studied in the paper.

The translation model presented in the paper is basically a simple recurrent language model. A recurrent language model receives at every timestep the current input word and has to predict the next word in the dataset. To translate with such a model, simply give it the current word from the source sentence and have it try to predict the next word from the target sentence.

Obviously, in many cases such a simple model wouldn't work. For example, if your sentence was "The white dog" and you wanted to translate to Spanish ("El perro blanco"), at the 2nd timestep, the input would be "white" and the expected output would be "perro" (dog). But how could the model predict "perro" when it hasn't seen "dog" yet?

To solve this issue, we preprocess the data before training and insert "empty" padding tokens into the target sentence. When the model outputs such a token, it means that the model would like to read more of the input sentence before emitting the next output word.

So in the example from above, we would change the target sentence to "El PAD perro blanco". Now, at timestep 2 the model emits the PAD symbol. At timestep 3, when the input is "dog", the model can emit the token "perro". These padding symbols are deleted in post-processing, before the output is returned to the user. You can see a visualization of the decoding process below:

https://i.imgur.com/znI6xoN.png

To enable us to use beam search, our model actually receives the previous outputted target token in addition to receiving the current source token at every timestep.

PyTorch code for the model is available at https://github.com/ofirpress/YouMayNotNeedAttention

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private