Welcome to ShortScience.org! 
[link]
Meta learning, or, the idea of training models on some distribution of tasks, with the hope that they can then learn more quickly on new tasks because they have “learned how to learn” similar tasks, has become a more central and popular research field in recent years. Although there is a veritable zoo of different techniques (to an amusingly literal degree; there’s an emergent fad of naming new methods after animals), the general idea is: have your inner loop consist of training a model on some task drawn from a distribution over tasks (be that maze learning with different wall configurations, letter identification from different languages, etc), and have the outer loop that updates some structural part of your model be based on improving generalization error on each task within the distribution. It’s been demonstrated that metalearned systems can in fact learn more quickly (at least when their tasks are “in distribution” relative to the distribution they were trained on, which is an important point to be cognizant of), but this paper is less interested with how much better or faster they’re learning, and more interested in whether there are qualitative differences in the way normal learning systems and metatrained learning systems go about learning a new task. The author (oddly for DeepMind, which typically goes in for super long author lists, there’s only the one on this paper) goes about this by studying simple learning tasks where it’s easier for us to introspect into what each model is learning over time. https://i.imgur.com/ceycq46.png In the first test, he looks at linear regression in a simple setting: where for each individual “task” data is generated according a known true weight matrix (sampled from a prior over weight matrices), with some noise added in. Given this weight matrix, he takes the singular value decomposition (think: PCA), and so ends up with a factorized representation of the weights, where higher eigenvalues on the factors, or “modes”, represent that those factors represent largerscale patterns that explain more variance, and lower eigenvalues are smaller scale refinements on top of that. He can apply this same procedure to the weights the network has learned at any given point in training, and compare, to see how close the network is to having correctly captured each of these different modes. When normal learners (starting from a raw initialization) approach the task, they start by matching the large scale (higher eigenvalue) factors of variation, and then over the course of training improve performance on the higherprecision factors. By contrast, meta learners, in addition to learning faster, also learn large scale and small scale modes at the same rate. Similar analysis was performed and similar results found for nonlinear regression, where instead of PCAstyle components, the function generating data were decomposed into different Fourier frequencies, and the normal learner learned the broad, lowfrequency patterns first, where the meta learner learned them all at the same rate. The paper finds intuition for this by showing that the behavior of the meta learners matches quite well against how a Bayesoptimal learner would update on new data points, in the world where that learner had a prior over the datagenerating weights that matched the true generating process. So, under this framing, the process of meta learning is roughly equivalent to your model learning a prior correspondant with the task distribution it was trained on. This is, at a high level, what I think we all sort of thought was happening with meta learning, but it’s pretty neat to see it laid out in a small enough problem where we can actually validate against an analytic model. A bit of a meta (heh) point: I wish this paper had more explanation of why the author chose to use the specific eigenvaluefocused metrics of progression on task learning that he did. They seem reasonable, but I’d have been curious to see an explication of what is captured by these, and what might be captured by alternative metrics of task progress. (A side note: the paper also contained a reinforcement learning experiment, but I both understood that one less well and also feel like it wasn’t really that analogous to the other tests) 
[link]
Reinforcement Learning is often broadly separated into two categories of approaches: modelfree and modelbased. In the former category, networks simply take observations and input and produce predicted bestactions (or predicted values of available actions) as output. In order to perform well, the model obviously needs to gain an understanding of how its actions influence the world, but it doesn't explicitly make predictions about what the state of the world will be after an action is taken. In modelbased approaches, the agent explicitly builds a dynamics model, or a model in which it takes in (past state, action) and predicts next state. In theory, learning such a model can lead to both interpretability (because you can "see" what the model thinks the world is like) and robustness to different reward functions (because you're learning about the world in a way not explicitly tied up with the reward). This paper proposes an interesting melding of these two paradigms, where an agent learns a model of the world as part of an endtoend policy learning. This works through something the authors call "observational dropout": the internal model predicts the next state of the world given the prior one and the action, and then with some probability, the state of the world that both the policy and the next iteration of the dynamics model sees is replaced with the model's prediction. This incentivizes the network to learn an effective dynamics model, because the farther the predictions of the model are from the true state of the world, the worse the performance of the learned policy will be on the iterations where the only observation it can see is the predicted one. So, this architecture is modelfree in the sense that the gradient used to train the system is based on applying policy gradients to the reward, but modelbased in the sense that it does have an internal world representation. https://i.imgur.com/H0TNfTh.png The authors find that, at a simple task, Swing Up Cartpole, very low probabilities of seeing the true world (and thus very high probabilities of the policy only seeing the dynamics model output) lead to world models good enough that a policy trained only on trajectories sampled from that model can perform relatively well. This suggests that at higher probabilities of the true world, there was less value in the dynamics model being accurate, and consequently less training signal for it. (Of course, policies that often could only see the predicted world performed worse during their original training iteration compared to policies that could see the real world more frequently). On a more complex task of CarRacing, the authors looked at how well a policy trained using the representations of the world model as input could perform, to examine whether it was learning useful things about the world. https://i.imgur.com/v9etll0.png They found an interesting tradeoff, where at high probabilities (like before) the dynamics model had little incentive to be good, but at low probabilities it didn't have enough contact with the real dynamics of the world to learn a sensible policy. 
[link]
This is an interesting paper, investigating (with a team that includes the original authors of the Lottery Ticket paper) whether the initializations that result from BERT pretraining have Lottery Ticketesque properties with respect to their role as initializations for downstream transfer tasks. As background context, the Lottery Ticket Hypothesis came out of an observation that trained networks could be pruned to remove lowmagnitude weights (according to a particular iterative pruning strategy that is a bit more complex than just "prune everything at the end of training"), down to high levels of sparsity (540% of original weights, and that those pruned networks not only perform well at the end of training, but also can be "rewound" back to their initialization values (or, in some cases, values from early in training) and retrained in isolation, with the weights you pruned out of the trained network still set to 0, to a comparable level of accuracy. This is thought of as a "winning ticket" because the hypothesis Frankle and Carbin generated is that the reason we benefit from massively overparametrized neural networks is that we are essentially sampling a large number of small subnetworks within the larger ones, and that the more samples we get, the likelier it is we find a "winning ticket" that starts our optimization in a place conducive to further training. In this particular work, the authors investigate a slightly odd variant of the LTH. Instead of looking at training runs that start from random initializations, they look at transfer tasks that start their learning from a massivelypretrained BERT language model. They try to find out: 1) Whether you can find "winning tickets" as subsets of the BERT initialization for a given downstream task 2) Whether those winning tickets generalize, i.e. whether a ticket/pruning mask for one downstream task can also have high performance on another. If that were the case, it would indicate that much of the value of a BERT initialization for transfer tasks could be captured by transferring only a small percentage of BERT's (many) weights, which would be beneficial for compression and mobile applications An interesting wrinkle in the LTH literature is the question of whether true "winning tickets" can be found (in the sense of the network being able to retrain purely from the masked random initializations), or whether it can only retrain to a comparable accuracy by rewinding to an early stage in training, but not the absolute beginning of training. Historically, the former has been difficult and sometimes not possible to find in more complex tasks and networks. https://i.imgur.com/pAF08H3.png One finding of this paper is that, when your starting point is BERT initialization, you can indeed find "winning tickets" in the first sense of being able to rewind the full way back to the beginning of (downstream task) training, and retrain from there. (You can see this above with the results for IMP, Iterative Magnitude Pruning, rolling back to theta0). This is a bit of an odd finding to parse, since it's not like BERT really is a random initialization itself, but it does suggest that part of the value of BERT is that it contains subnetworks that, from the start of training, are in notional optimization basins that facilitate future training. A negative result in this paper is that, by and large, winning tickets on downstream tasks don't transfer from one to another, and, to the extent that they do transfer, it mostly seems to be according to which tasks had more training samples used in the downstream maskfinding process, rather than any qualitative properties of the task. The one exception to this was if you did further training of the original BERT objective, Masked Language Modeling, as a "downstream task", and took the winning ticket mask from that training, which then transferred to other tasks. This is some validation of the premise that MLM is an unusually good training task in terms of its transfer properties. An important thing to note here is that, even though this hypothesis is intriguing, it's currently quite computationally expensive to find "winning tickets", requiring an iterative pruning and retraining process that takes far longer than an original training run would have. The real goal here, which this is another small step in the hopeful direction of, is being able to analytically specify subnetworks with valuable optimization properties, without having to learn them from data each time (which somewhat defeats the point, if they're only applicable for the task they're trained on, though is potentially useful is they do transfer to some other tasks, as has been shown within a set of imageprediction tasks). 
[link]
#### This nice paper looks amazing at the first sight since it brings a mixture of:  Fancy models  Stateofart training procedure(considering the 32GPU distributed training effort which takes 21 days to get the best result)  Significant theory metric improvement(single model: 51.3 > 30 perplexity reduction, ensemble model:41.0 > 23.7)  Benchmark on a somewhat industry scale(vocabulary of 793471 words, 0,8B words training data) dataset rather than a pure research one. #### However, I also want to add some criticism:  As [1] mentioned perplexity is somewhat confusing metric, big perplexity may not reflect the real improvement, it would rather bring some kind of "exaggerating" effect.  This paper only provide the language model improvement, however, LMs are usually embedded into a complex usage scenario, such as speech recognition or machine translation. It would be more insightful if the LMs provided in this paper could share its result with integrating into some endtoend products. Since the authors are working for Google Brain team, this is not too much a stringent requirement.  So far as I know, the data set used by this paper is from news stories[2], this kind of data set is more formal than oral one. And for real application, what we face are usually less formal data(such as search engine and speech recognition). It is still a question what the best model mentioned in this paper will perform in a more realistic scenario. Again, for Google Brain team, this should not be a big obstacles for integrating it with existing system just by replacing or complementing the existing LMs. Although I posted some personal criticism, I do still appreciate this nice paper and recommend this as a "mustread" for NLP and related guys since I do think this paper provide a unifying and comprehensive surveystyle perspective for us to help grasp the latest stateofart language model technology in an efficient way. References:  [1].http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf  [2].http://static.googleusercontent.com/media/research.google.com/zhCN//pubs/archive/41880.pdf 
[link]
This paper describes how rank pooling, a very recent approach for pooling representations organized in a sequence $\\{{\bf v}_t\\}_{t=1}^T$, can be used in an endtoend trained neural network architecture. Rank pooling is an alternative to average and max pooling for sequences, but with the distinctive advantage of maintaining some order information from the sequence. Rank pooling first solves a regularized (linear) support vector regression (SVR) problem where the inputs are the vector representations ${\bf v}_t$ in the sequence and the target is the corresponding index $t$ of that representation in the sequence (see Equation 5). The output of rank pooling is then simply the linear regression parameters $\bf{u}$ learned for that sequence. Because of the way ${\bf u}$ is trained, we can see that ${\bf u}$ will capture order information, as successful training would imply that ${\bf u}^\top {\bf v}_t < {\bf u}^\top {\bf v}_{t'} $ if $t < t'$. See [this paper](https://www.robots.ox.ac.uk/~vgg/rg/papers/videoDarwin.pdf) for more on rank pooling. While previous work has focused on using rank pooling on handdesigned and fixed representations, this paper proposes to use ConvNet features (pretrained on ImageNet) for the representation and backpropagate through rank pooling to finetune the ConvNet features. Since the output of rank pooling corresponds to an argmin operation, passing gradients through this operation is not as straightforward as for average or max pooling. However, it turns out that if the objective being minimized (in our case regularized SVR) is twice differentiable, gradients with respect to its argmin can be computed (see Lemmas 1 and 2). The authors derive the gradient for rank pooling (Equation 21). Finally, since its gradient requires inverting a matrix (corresponding to a hessian), the authors propose to either use an efficient procedure for computing it by exploiting properties of sums of rankone matrices (see Lemma 3) or to simply use an approximation based on using a diagonal hessian. In experiments on two small scale video activity recognition datasets (UCFSports and Hollywood2), the authors show that finetuning the ConvNet features significantly improves the performance of rank pooling and makes it superior to max and average pooling. **My two cents** This paper was eye opening for me, first because I did not realize that one could backpropagate through an operation corresponding to an argmin that doesn't have a closed form solution (though apparently this paper isn't the first to make that observation). Moreover, I did not know about rank pooling, which itself is a really thought provoking approach to pooling representations in a way that preserves some organizational information about the original representations. I wonder how sensitive the results are to the value of the regularization constant of the SVR problem. The authors mention some theoretical guaranties on the stability of the solution found by SVR in general, but intuitively I would expect that the regularization constant would play a large role in the stability. I'll be looking forward to any future attempts to increase the speed of rank pooling (or any similar method). Indeed, as the authors mention, it is currently too slow to be used on the larger video datasets that are currently available. Code for computing rank pooling (though not for computing its gradients) seems to be available [here](https://bitbucket.org/bfernando/videodarwin).
2 Comments
