![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
I should say from the outset: I have a lot of fondness for this paper. It goes upstream of a lot of research-community incentives: It’s not methodologically flashy, it’s not about beating the State of the Art with a bigger, better model (though, those papers certainly also have their place). The goal of this paper was, instead, to dive into a test set used to evaluate performance of models, and try to understand to what extent it’s really providing a rigorous test of what we want out of model behavior. Test sets are the often-invisible foundation upon which ML research is based, but like real-world foundations, if there are weaknesses, the research edifice built on top can suffer. Specifically, this paper discusses the Winograd Schema, a clever test set used to test what the NLP community calls “common sense reasoning”. An example Winograd Schema sentence is: The delivery truck zoomed by the school bus because it was going so fast. A model is given this task, and asked to predict which token the underlined “it” refers to. These cases are specifically chosen because of their syntactic ambiguity - nothing structural about the order of the sentence requires “it” to refer to the delivery truck here. However, the underlying meaning of the sentence is only coherent under that parsing. This is what is meant by “common-sense” reasoning: the ability to understand meaning of a sentence in a way deeper than that allowed by simple syntactic parsing and word co-occurrence statistics. Taking the existing Winograd examples (and, when I said tiny, there are literally 273 of them) the authors of this paper surface some concerns about ways these examples might not be as difficult or representative of “common sense” abilities as we might like. - First off, there is the basic, previously mentioned fact that there are so few examples that it’s possible to perform well simply by random chance, especially over combinatorially large hyperparameter optimization spaces. This isn’t so much an indictment of the set itself as it is indicative of the work involved in creating it. - One of the two distinct problems the paper raises is that of “associativity”. This refers to situations where simple co-occurance counts between the description and the correct entity can lead the model to the correct term, without actually having to parse the sentence. An example here is: “I’m sure that my map will show this building; it is very famous.” Treasure maps aside, “famous buildings” are much more generally common than “famous maps”, and so being able to associate “it” with a building in this case doesn’t actually require the model to understand what’s going on in this specific sentence. The authors test this by creating a threshold for co-occurance, and, using that threshold, call about 40% of the examples “associative” - The second problem is that of predictable structure - the fact that the “hinge” adjective is so often the last word in the sentence, making it possible that the model is brittle, and just attending to that, rather than the sentence as a whole The authors perform a few tests - examining results on associative vs non-associative examples, and examining results if you switch the ordering (in cases like “Emma did not pass the ball to Janie although she saw that she was open,” where it’s syntactically possible), to ensure the model is not just anchoring on the identity of the correct entity, regardless of its place in the sentence. Overall, they found evidence that some of the state of the art language models perform well on the Winograd Schema as a whole, but do less well (and in some cases even less well than the baselines they otherwise outperform) on these more rigorous examples. Unfortunately, these tests don’t lead us automatically to a better solution - design of examples like this is still tricky and hard to scale - but does provide valuable caution and food for thought. ![]() |
[link]
For solving sequence modeling problems, recurrent architectures have been historically the most commonly used solution, but, recently, temporal convolution networks, especially with dilations to help capture longer term dependencies, have gained prominence. RNNs have theoretically much larger capacity to learn long sequences, but also have a lot of difficulty propagating signal forward through long chains of recurrent operations. This paper, which suggests the approach of Trellis Networks, places itself squarely in the middle of the debate between these two paradigms. TrellisNets are designed to be a theoretical bridge between between temporal convolutions and RNNs - more specialized than the former, but more generalized than the latter. https://i.imgur.com/J2xHYPx.png The architecture of TrellisNets is very particular, and, unfortunately, somewhat hard to internalize without squinting at diagrams and equations for awhile. Fundamentally: - At each layer in a TrellisNet, the network creates a “candidate pre-activation” by combining information from the input and the layer below, for both the current and former time step. - This candidate pre-activation is then non-linearly combined with the prior layer, prior-timestep hidden state - This process continues for some desired number of layers. https://i.imgur.com/f96QgT8.png At first glance, this structure seems pretty arbitrary: a lot of quantities connected together, but without a clear mechanic for what’s happening. However, there are a few things interesting to note here, which will help connect these dots, to view TrellisNet as either a kind of RNN or a kind of CNN: - TrellisNet uses the same weight matrices to process prior and current timestep inputs/hidden states, no matter which timestep or layer it’s on. This is strongly reminiscent of a recurrent architecture, which uses the same calculation loop at each timestep - TrellisNets also re-insert the model’s input at each layer. This also gives it more of a RNN-like structure, where the prior layer’s values act as a kind of “hidden state”, which are then combined with an input value - At a given layer, each timestep only needs access to two elements of the prior layer (in addition to the input); it does not require access to all the prior-timestep values of it’s own layer. This is important because it means that you can calculate an entire layer’s values at once, given the values of the prior layer: this means these models can be more easily parallelized for training Seeing TrellisNets as a kind of Temporal CNN is fairly straightforward: each timestep’s value, at a given layer, is based on a “filter” of the lower-layer value at the current and prior timestep, and this filter is shared across the whole sequence. Framing them as a RNN is certainly trickier, and anyone wanting to understand it in full depth is probably best served by returning to the paper’s equations. At at high level, the authors show that TrellisNets can represent a specific kind of RNN: a truncated RNN, where each timestep only uses history from the prior M time steps, rather than the full sequence. This works by sort of imagining the RNN chains as existing along the diagonals of a TrellisNet architecture diagram: as you reach higher levels, you can also reach farther back in time. Specifically, a TrellisNet that wants to represent a depth K truncated RNN, which is allowed to unroll through M steps of history, can do so using M + K - 1 layers. Essentially, by using a fixed operation across layers and timesteps, the TrellisNet authors blur the line between layer and timestep: any chain of operations, across layers, is fundamentally a series of the same operation, performed many times, and is in that way RNN-like. The authors have not yet taken a stab at translation, but tested their model on a number of word and character-level language modeling tasks (predicting the next word or character, given prior ones), and were able to successfully beat SOTA on many of them. I’d be curious to see more work broadly in this domain, and also gain a better understanding of areas in which a fixed, recurrently-used layer operation, like the ones used in RNNs and this paper, is valuables, and areas (like a “normal” CNN) where having specific weights for different levels of the hierarchy is valable. ![]() |
[link]
This paper is, on the whole, a refreshing jaunt into the applied side of the research word. It isn’t looking to solve a fundamental machine learning problem in some new way, but it does highlight and explore one potential beneficial application of a common and widely used technique: specifically, combining word embeddings with context-free grammars (such as: regular expressions), to make the latter less rigid. Regular expressions work by specifying specific hardcoded patterns of symbols, and matching against any strings in some search set that match those patterns. They don’t need to specify specific characters - they can work at higher levels of generality, like “uppercase alphabetic character” or “any character”, but they’re still fundamentally hardcoded, in that the designer of the expression needs to create a specification that will affirmatively catch all the desired cases. This can be a particular challenging task when you’re trying to find - for example - all sentences that match the pattern of someone giving someone else a compliment. You might want to match against “I think you’re smart” and also “I think you’re clever”. However, in the normal use of regular expressions, something like this would be nearly impossible to specify, short of writing out every synonym for “intelligent” that you can think of. The “Embedding Grammars” paper proposes a solution to this problem: instead of enumerating a list of synonyms, simply provide one example term, or, even better, a few examples, and use those term’s word embedding representation to define a “synonym bubble” (my word, not theirs) in continuous space around those examples. This is based on the oft-remarked-upon fact that, because word embedding systems are generally trained to push together words that can be used in similar contexts, closeness in word vector space frequently corresponds to words being synonyms, or close in some other sense. So, if you “match” to any term that is sufficiently nearby to your exemplar terms, you are performing something similar to the task of enumerating all of a term’s syllables. Once this general intuition is in hand, the details of the approach are fairly straightforward: the authors try a few approaches, and find that constructing a bubble of some epsilon around each example’s word vector, and matching to anything inside that bubble, works the best as an approach. https://i.imgur.com/j9OSNuE.png Overall, this seems like a clever idea; one imagines that the notion of word embeddings will keep branching out into ever more far-flung application as time goes on. There are reasons to be skeptical of this paper, though. Fundamentally, word embedding space is a “here there be dragons” kind of place: we may be able to observe broad patterns, and might be able to say that “nearby words tend to be synonyms,” but we can’t give any kind of guarantee of that being the case. As an example of this problem, often the nearest thing to an example, after direct synonyms, are direct antonyms, so if you set too high a threshold, you’ll potentially match to words exactly the opposite of what you expect. We are probably still a ways away from systems like this one being broady useful, for this and other reasons, but I do think it’s valuable to try to understand what questions we’d want answered, what features of embedding space we’d want more elucidated, before applications like these would be more stably usable. ![]() |
[link]
An attention mechanism and a separate encoder/decoder are two properties of almost every single neural translation model. The question asked in this paper is- how far can we go without attention and without a separate encoder and decoder? And the answer is- pretty far! The model presented preforms just as well as the attention model of Bahdanau on the four language directions that are studied in the paper. The translation model presented in the paper is basically a simple recurrent language model. A recurrent language model receives at every timestep the current input word and has to predict the next word in the dataset. To translate with such a model, simply give it the current word from the source sentence and have it try to predict the next word from the target sentence. Obviously, in many cases such a simple model wouldn't work. For example, if your sentence was "The white dog" and you wanted to translate to Spanish ("El perro blanco"), at the 2nd timestep, the input would be "white" and the expected output would be "perro" (dog). But how could the model predict "perro" when it hasn't seen "dog" yet? To solve this issue, we preprocess the data before training and insert "empty" padding tokens into the target sentence. When the model outputs such a token, it means that the model would like to read more of the input sentence before emitting the next output word. So in the example from above, we would change the target sentence to "El PAD perro blanco". Now, at timestep 2 the model emits the PAD symbol. At timestep 3, when the input is "dog", the model can emit the token "perro". These padding symbols are deleted in post-processing, before the output is returned to the user. You can see a visualization of the decoding process below: https://i.imgur.com/znI6xoN.png To enable us to use beam search, our model actually receives the previous outputted target token in addition to receiving the current source token at every timestep. PyTorch code for the model is available at https://github.com/ofirpress/YouMayNotNeedAttention ![]() |
[link]
I admit it - the title of the paper pulled me in, existing as it does in the chain of weirdly insider-meme papers, starting with Vaswani’s 2017 “Attention Is All You Need”. That paper has been hugely influential, and the domain of machine translation as a whole has begun to move away from processing (or encoding) source sentences with recurrent architectures, to instead processing them using self-attention architectures. (Self-attention is a little too nuanced to go into in full depth here, but the basic idea is: instead of summarizing varying-length sequences by feeding each timestep into a recurrent loop and building up hidden states, generate a query, and weight the contribution of each timestep to each “hidden state” based on the dot product between that query and each timestep’s representation). There has been an overall move in recent years away from recurrence being the accepted default for sequence data, and towards attention and (often dilated) convolution taking up more space. I find this an interesting set of developments, and had hopes that this paper would address that arc. However, unfortunately, the title was quite out of sync with the actual focus of the paper - instead of addressing the contribution of attention mechanisms vs recurrence, or even directly addressing any of the particular ideas posed in the “Attention is All You Need” paper, this YMNNA instead takes aim at a more fundamental structural feature of translation models: the encoder/decoder structure. The basic idea of an encoder/decoder approach, in a translation paradigm, is that you process the entire source sentence before you start generating the tokens of the predicted, other-language target sentence. Initially, this would work by running a RNN over the full sentence, and using the final hidden state of that RNN as a compressed representation of the full sentence. More recently, the norm has been to use multiple layers of RNN, and to represent the source sentence via the hidden states at each timestep (so: as many hidden states as you have input tokens), and then at each step in the decoding process, calculate an attention-weighted average over all of those hidden states. But, fundamentally, both of these structures share the fact that some kind of global representation is calculated and made available to the decoder before it starts predicting words in the output sentence. This makes sense for a few reasons. First, and most obviously, languages aren’t naturally aligned with one another, in the sense of one word in language X corresponding to one word in language Y. It’s not possible for you to predict a word in the target sentence if its corresponding source sentence token has not yet been processed. For another, there can be contextual information from the sentence as a whole that can disambiguate between different senses of a word, which may have different translations - think Teddy Bear vs Teddy Roosevelt. However, this paper poses the question: how well can you do if you throw away this structure, and build a model that continually emits tokens of the target sequence as it reads in the source sentence? Using a recurrent model, the YMNNA model takes, at each timestep, the new source token, the previous target token, and the prior hidden state from the last time step of the RNN, and uses that to predict a token. However, that problem mentioned earlier - of languages not natively being aligned such that you have the necessary information to predict a word by the time you get to its point in the target sequence - hasn’t gone away, and is still alive and kicking. This paper solves it in a pretty unsatisfying way - by relying on an external tool, fast-align, that does the work of guessing which source tokens correspond to which target tokens, and inserting buffer tokens into the target, so that you don’t need to predict a word until it’s already been seen by the source-reading RNN; until then you just predict the buffer. This is fine and clever as a practical heuristic, but it really does make their comparisons against models that do alignment and translation jointly feel a little weak. https://i.imgur.com/Gitpxi7.png An additional heuristic that makes the overall narrative of the paper less compelling is the fact that, in order to get comparable performance to their baselines, they padded the target sequences with between 3 and 5 buffer tokens, meaning that the models learned that they could process the first 3-5 tokens of the sentence before they need to start emitting the target. Again, there’s nothing necessarily wrong with this, but, since they are consuming a portion of the sentence before they start emitting translations, it does make for a less stark comparison with the “read the whole sentence” encoder/decoder framework. A few other frustrations, and notes from the paper’s results section: As earlier mentioned, the authors don’t actually compare their work against the “Attention is All You Need” paper, but instead to a 2014 paper. This is confusing both in terms of using an old baseline for SOTA, and also in terms of their title implicitly arguing they are refuting a paper they didn’t compare to Comparing against their old baseline, their eager translation model performs worse on all sentences less than 60 tokens in length (which makes up the vast majority of all the sentences there are), and only beats the baseline on sentences > 60 tokens in length Additionally, they note as a sort of throwaway line that their model took almost three times as long to train as the baseline, with the same amount of parameters, simply because it took so much longer to converge. Being charitable, it seems like there is some argument that an eager translation framework performs well on long sentences, and can do so while only keeping a hidden state in memory, rather than having to keep the hidden states for each source sequence element around, like attention-based decoders require. However, overall, I found this paper to be a frustrating let-down, that used too many heuristics and hacks to be a compelling comparison to prior work. ![]()
1 Comments
|