Welcome to ShortScience.org! 
[link]
This paper explores the use of socalled Monte Carlo objectives for training directed generative models with latent variables. Monte Carlo objectives take the form of the logarithm of a Monte Carlo estimate (i.e. an average over samples) of the marginal probability $P(x)$. One important motivation for using Monte Carlo objectives is that they can be shown (see the Importance Weighted Variational Autoencoder paper \cite{journals/corr/BurdaGS15} and my notes on it) to correspond to bounds on the true likelihood of the model, and one can tighten the bound simply by drawing more samples in the Monte Carlo objective. Currently, the most successful application of Monte Carlo objectives is based on an importance sampling estimate, which involves training a proposal distribution $Q(hx)$ in addition to the model $P(x,h)$. This paper considers the problem of training with gradient descent on such objectives, in the context of a model to which the reparametrization trick cannot be used (e.g. for discrete latent variables). They analyze the sources of variance in the estimation of the gradients (see Equation 5) and propose a very simple approach to reducing the variance of a samplingbased estimator of these gradients. First, they argue that gradients with respect to the $P(x,h)$ parameters are less susceptible to problems due to high variance gradients. Second, and most importantly, they derive a multisample estimate of the gradient that is meant to reduce the variance of gradients on the proposal distribution parameters $Q(hx)$. The end result is the gradient estimate of Equations 1011. It is based on the observation that the first term of the gradient of Equation 5 doesn't distinguish between the contribution of each sampled latent hi. The key contribution is this: they notice that one can incorporate a variance reducing baseline for each sample hi, corresponding to the Monte Carlo estimate of the loglikelihood when removing hi from the estimate (see Equation 10). The authors show that this is a proper baseline, in that using it doesn't introduce a bias in the estimation for the gradients. Experiments show that this approach yields better performance than training based on Reweighted Wake Sleep \cite{journals/corr/BornscheinB14} or the use of NVIL baselines \cite{conf/icml/MnihG14}, when training sigmoid belief networks as generative models or as structured output prediction (image completion) models on binarized MNIST. 
[link]
This paper proposes a neural architecture that allows to backpropagate gradients though a procedure that can go through a variable and adaptive number of iterations. These "iterations" for instance could be the number of times computations are passed through the same recurrent layer (connected to the same input) before producing an output, which is the case considered in this paper. This is essentially achieved by pooling the recurrent states and respective outputs computed by each iteration. The pooling mechanism is essentially the same as that used in the really cool Neural Stack architecture of Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman and Phil Blunsom \cite{conf/nips/GrefenstetteHSB15}. It relies on the introduction of halting units, which are sigmoidal units computed at each iteration and which gives a soft weight on whether the computation should stop at the current iteration. Crucially, the paper introduces a new ponder cost $P(x)$, which is a regularization cost that penalizes what is meant to be a smooth upper bound on the number of iterations $N(t)$ (more on that below). The paper presents experiment on RNNs applied on sequences where, at each time step t (not to be confused with what I'm calling computation iterations, which are indexed by n) in the sequence the RNN can produce a variable number $N(t)$ of intermediate states and outputs. These are the states and outputs that are pooled, to produce a single recurrent state and output for the time step t. During each of the $N(t)$ iterations at time step t, the intermediate states are connected to the same timestept input. After the $N(t)$ iterations, the RNN pools the $N(t)$ intermediate states and outputs, and then moves to the next time step $t+1$. To mark the transitions between time steps, an extra binary input is appended, which is 1 only for the first intermediate computation iteration. Results are presented on a variety of synthetic problems and a character prediction problem. 
[link]
This paper describes an architecture designed for generating class predictions based on a set of features in situations where you may only have a few examples per class, or, even where you see entirely new classes at test time. Some prior work has approached this problem in ridiculously complex fashion, up to and including training a network to predict the gradient outputs of a metanetwork that it thinks would best optimize loss, given a new class. The method of Prototypical Networks prides itself on being much simpler, and more intuitive, so I hope I’ll be able to convey that in this explanation. In order to think about this problem properly, it makes sense to take a few steps back, and think about some fundamental assumptions that underly machine learning. https://i.imgur.com/Q45w0QT.png One very basic one is that you need some notion of similarity between observations in your training set, and potential new observations in your test set, in order to properly generalize. To put it very simplistically, if a test example is very similar to examples of class A that we saw in training, we might predict it to be of class A at testing. But what does it *mean* for two observations to be similar to one another? If you’re using a method like K Nearest Neighbors, you calculate a point’s class identity based on the closest trainingset observations to it in Euclidean space, and you assume that nearness in that space corresponds to likelihood of two data points having come the same class. This is useful for the use case of having new classes show up after training, since, well, there isn’t really a training period: the strategy for KNN is just carrying your whole training set around, and, whenever a new test point comes along, calculating it’s closest neighbors among those trainingset points. If you see a new class in the wild, all you need to do is add the examples of that class to your group of training set points, and then after a few examples, if your assumptions hold, you’ll be able to predict that class by (hopefully) finding those two or three points as neighbors. But what if some dimensions of your feature space matter much more than others for differentiating between classes? In a simplistic example, you could have twenty features, but, unbeknownst to you, only one is actually useful for separating out your classes, and the other 19 are random. If you use the naive KNN assumption, you wouldn’t expect to perform well here, because you will have distances in these 19 meaningless directions spreading out your points, due to randomness, more than the meaningful dimension spread them out due to belonging to different classes. And what if you want to be able to learn nonlinear relationships between your features, which the composability of multilayer neural networks lends itself well to? In cases like those, the features you were handed may be a woefully suboptimal metric space in which to calculate a kind of similarity that corresponds to differences in class identity, so you’ll just have to strike out for the territories and create a metric space for yourself. That is, at a very high level, what this paper seeks to do: learn a transformation between input features and some vector space, such that distances in that vector space correspond as well as possible to probabilities of belonging to a given output class. You may notice me using “vector space” and “embedding” similarity; they are the same idea: the result of that learned transformation, which represents your input observations as dense vectors in some pdimensional space, where p is a chosen hyperparameter. What are the concrete learning steps this architecture goes through? 1. During each training episode, sample a subset of classes, and then divide those classes into training examples, and query examples 2. Using a set of weights that are being learned by the network, map the input features of each training example into a vector space. 3. Once all training examples are mapped into the space, calculate a “mean vector” for class A by averaging all of the embeddings of training examples that belong to class A. This is the “prototype” for class A, and once we have it, we can forget the values of the embedded examples that were averaged to create it. This is a nice update on the KNN approach, since the number of parameters we need to carry around to evaluate is only (numdimensions) * (numclasses), rather than (numdimensions) * (numtrainingexamples). 4. Then, for each query example, map it into the embedding space, and use a distance metric in that space to create a softmax over possible classes. (You can just think of a softmax as a network’s predicted probability, it’s a set of floats that add up to 1). 5. Then, you can calculate the (crossentropy) error between the true output and that softmax prediction vector in the same way as you would for any classification network 6. Add up the prediction loss for all the query examples, and then backpropogate through the network to update your weights The overall effect of this process is to incentivize your network to learn, not necessarily a good prediction function, but a good metric space. The idea is that, if the metric space is good enough, and the classes are conceptually similar to each other (i.e. car vs chair, as opposed to car vs themeaningoflife), a space that does well at causing similar observed classes to be close to one another will do the same for classes not seen during training. I admit to not being sufficiently familiar with the datasets used for testing to have a sense for how well this method compares to more fully supervised classification schemes; if anyone does, definitely let me know! But the paper claims to get state of the art results compared to other approaches in this domain of fewshot learning (matching networks, and the aforementioned metalearning). One interesting note is that the authors found that squared Euclidean distance, when applied within the embedded space, worked meaningfully better than cosine distance (which is a more standard way of measuring distances between vectors, since it measures only angle, rather than magnitude). They suspect that this is because Euclidean distance, but not cosine distance belongs to a category of divergence/distance metrics (called Bregman Divergences) that have a special set of properties such that the point closest on aggregate to all points in a cluster is the average of all those points. If you want to dive way deep into the minutia on this point, I found this blog post quite good: http://mark.reid.name/blog/meetthebregmandivergences.html
1 Comments

[link]
Ulyanov et al. utilize untrained neural networks as regularizer/prior for various image restoration tasks such as denoising, inpainting and superresolution. In particualr, the standard formulation of such tasks, i.e. $x^\ast = \arg\min_x E(x, x_0) + R(x)$ where $x_0$ is the input image and $E$ a taskdependent data term, is rephrased as follows: $\theta^\ast = \arg\min_\theta E(f_\theta(z); x_0)$ and $x^\ast = f_{\theta^\ast}(z)$ for a fixed but random $z$. Here, the regularizer $R$ is essentially replaced by an untrained neural network $f_\theta$ – usually in the form of a convolutional encoder. The authors argue that the regualizer is effectively $R(x) = 0$ if the image can be generated by the encoder from the fixed code $z$ and $R(x) = \infty$ if not. However, this argument does not necessarily provide any insights on why this approach works (as demonstrated in the paper). A main question addressed in the paper is why the network $f_\theta$ can be used as a prior – regarding the assumption that highcapacity networks can essentially fit any image (including random noise). In my opinion, the authors do not give a convincing answer to this question. Essentially, they argue that random noise is just harder to fit (i.e. it takes longer). Therefore, limiting the number of iterations is enough as regularization. Personally I would argue that this observation is mainly due to prior knowledge put into the encoder architecture and the idea that natural images (or any images with some structure) are easily embedded into lowdimensional latent spaced compared to fully I.i.d. random noise. They provide experiments on a range of tasks including denoising, image inpainting, superresolution and neural network “inversion”. Figure 1 shows some results for image inpainting that I found quite convincing. For the remaining experiments I refer to the paper. https://i.imgur.com/BVQsaup.png Figure 1: Qualitative results for image inpainting. Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
The general goal of metalearning systems is to learn useful shared structure across a broad distribution of tasks, in such a way that learning on a new task can be faster. Some of the historical ways this has been done have been through initializations (i.e. initializing the network at a point such that it is easy to further optimize on each individual task, drawn from some distribution of tasks), and recurrent network structures (where you treat the multiple timesteps of a recurrent network as the training iterations on a single task, and train the recurrent weights of the network based on generalization performance on a wide range of tasks). This paper proposes a different approach: a learned proxy loss function. The idea here is that, often, early in the learning process, handcoded rewards aren’t the best or most valuable signal to use to guide a network, both because they may be high variance, and because they might not natively incentivize things like exploration rather than just exploitation. A better situation would be if we had some more farsighted loss function we could use, that had proved to be a good proxy over a variety of different rewards. This is exactly what this method proposes to give us. Training consists of an inner loop, and an outer loop. Each instantiation of the inner loop corresponds to a single RL task, drawn from a distribution over tasks (for example, all tasks involving the robot walking to a position, with a single instantiated task being the task of walking to one specific position). Within the inner loop, we apply a typical policy gradient loop of optimizing the parameters of our policy, except, instead of expected rewards, we optimize our policy parameters according to a loss function we specifically parametrize. Within the outer loop, we take as signal the final reward on the trained policy on this task, and use that to update our parametrized loss. This parametrized loss is itself a neural network, that takes in the agent’s most recent set of states, actions, and rewards at a rolling window of recent timesteps, and performs temporal convolutions on those, to get a final loss value out the other side. In short, this auxiliary network takes in information about the agent’s recent behavior, and outputs an assessment of how well the agent is doing according to this longerview loss criteria. Because it’s not possible to directly formulate the test performance of a policy in terms of the loss function that was used to train the policy (which would be necessary for backprop), the weights of this losscalculating network are instead learned via evolutionary strategies. At a zoomedout level of complexity, this means: making small random perturbations to the current parameters of the network, and moving in the direction of the random change that works the best. So, ultimately, you end up with a loss network that takes in recent environmental states and the behavior of the agent, and returns an estimate of the proxy loss value, that has hopefully been trained such that it captures environmental factors that indicate progress on the task, over a wide variety of similar tasks. Then, during testing, the RL agent can use that loss function to adapt its behavior. An interesting note here is that for tasks where the parameters of the task being learned are inferable from the environment  for example, where the goal is “move towards the green dot”, you don’t actually need to give the agent the rewards from a new task; ideally, it will have learned how to infer the task from the environment. One of the examples they use to prove their method has done something useful is train their model entirely on tasks where an antagent’s goal is to move towards various different targets on the right, and then shift it to a scenario where its target is towards the left. In the EPG case, the ant was able to quickly learn to move left, because it’s loss function was able to adapt to the new environment where the target had moved. By contrast, RL^2 (a trained learning algorithm implemented as a recurrent network) kept on moving right as its initial strategy, and seemed unable to learn the specifics of a task outside its original task distribution of “always move right”. I think this paper could benefit from being a little bit more concrete about what it’s expected use cases are (like: what kinds of environments lend themselves to having proxy loss functions inferred from environmental data? Which don’t?), but overall, I find the kernel of idea this model introduces interesting, and will be interested to see if other researchers run with it. 