Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1567 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

When to use parametric models in reinforcement learning?

van Hasselt, Hado and Hessel, Matteo and Aslanides, John

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

van Hasselt, Hado and Hessel, Matteo and Aslanides, John

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

[link]
This paper is a bit provocative (especially in the light of the recent DeepMind MuZero paper), and poses some interesting questions about the value of model-based planning. I'm not sure I agree with the overall argument it's making, but I think the experience of reading it made me hone my intuitions around why and when model-based planning should be useful. The overall argument of the paper is: rather than learning a dynamics model of the environment and then using that model to plan and learn a value/policy function from, we could instead just keep a large replay buffer of actual past transitions, and use that in lieu of model-sampled transitions to further update our reward estimators without having to acquire more actual experience. In this paper's framing, the central value of having a learned model is this ability to update our policy without needing more actual experience, and it argues that actual real transitions from the environment are more reliable and less likely to diverge than transitions from a learned parametric model. It basically sees a big buffer of transitions as an empirical environment model that it can sample from, in a roughly equivalent way to being able to sample transitions from a learnt model. An obvious counter-argument to this is the value of models in being able to simulate particular arbitrary trajectories (for example, potential actions you could take from your current point, as is needed for Monte Carlo Tree Search). Simply keeping around a big stock of historical transitions doesn't serve the use case of being able to get a probable next state *for a particular transition*, both because we might not have that state in our data, and because we don't have any way, just given a replay buffer, of knowing that an available state comes after an action if we haven't seen that exact combination before. (And, even if we had, we'd have to have some indexing/lookup mechanism atop the data). I didn't feel like the paper's response to this was all that convincing. It basically just argues that planning with model transitions can theoretically diverge (though acknowledges it empirically often doesn't), and that it's dangerous to update off of "fictional" modeled transitions that aren't grounded in real data. While it's obviously definitionally true that model transitions are in some sense fictional, that's just the basic trade-off of how modeling works: some ability to extrapolate, but a realization that there's a risk you extrapolate poorly. https://i.imgur.com/8jp22M3.png The paper's empirical contribution to its argument was to argue that in a low-data setting, model-free RL (in the form of the "everything but the kitchen sink" Rainbow RL algorithm) with experience replay can outperform a model-based SimPLe system on Atari. This strikes me as fairly weak support for the paper's overall claim, especially since historically Atari has been difficult to learn good models of when they're learnt in actual-observation pixel space. Nonetheless, I think this push against the utility of model-based learning is a useful thing to consider if you do think models are useful, because it will help clarify the reasons why you think that's the case. |

DRAW: A Recurrent Neural Network For Image Generation

Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan

International Conference on Machine Learning - 2015 via Local Bibsonomy

Keywords: dblp

Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan

International Conference on Machine Learning - 2015 via Local Bibsonomy

Keywords: dblp

[link]
The paper introduces a sequential variational auto-encoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data. #### What is DRAW: The deep recurrent attention writer (DRAW) model has two differences with respect to other variational auto-encoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder. #### What do we gain? The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem. #### What follows? A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network. #### Like: * As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way. * The attention model is fully differentiable. #### Dislike: * I think a better exposition of the attention mechanism would improve this paper. |

Bayesian dark knowledge

Balan, Anoop Korattikara and Rathod, Vivek and Murphy, Kevin P. and Welling, Max

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

Balan, Anoop Korattikara and Rathod, Vivek and Murphy, Kevin P. and Welling, Max

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

[link]
This paper combines two ideas. The first is stochastic gradient Langevin dynamics (SGLD), which is an efficient Bayesian learning method for larger datasets, allowing to efficiently sample from the posterior over the parameters of a model (e.g. a deep neural network). In short, SGLD is stochastic (minibatch) gradient descent, but where Gaussian noise is added to the gradients before each update. Each update thus results in a sample from the SGLD sampler. To make a prediction for a new data point, a number of previous parameter values are combined into an ensemble, which effectively corresponds to Monte Carlo estimate of the posterior predictive distribution of the model. The second idea is distillation or dark knowledge, which in short is the idea of training a smaller model (student) in replicating the behavior and performance of a much larger model (teacher), by essentially training the student to match the outputs of the teacher. The observation made in this paper is that the step of creating an ensemble of several models (e.g. deep networks) can be expensive, especially if many samples are used and/or if each model is large. Thus, they propose to approximate the output of that ensemble by training a single network to predict to output of ensemble. Ultimately, this is done by having the student predict the output of a teacher corresponding to the model with the last parameter value sampled by SGLD. Interestingly, this process can be operated in an online fashion, where one alternates between sampling from SGLD (i.e. performing a noisy SGD step on the teacher model) and performing a distillation update (i.e. updating the student model, given the current teacher model). The end result is a student model, whose outputs should be calibrated to the bayesian predictive distribution. |

Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets

Vincent, Pascal and de Brébisson, Alexandre and Bouthillier, Xavier

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

Vincent, Pascal and de Brébisson, Alexandre and Bouthillier, Xavier

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

[link]
This paper presents a linear algebraic trick for computing both the value and the gradient update for a loss function that compares a very high-dimensional target with a (dense) output prediction. Most of the paper exposes the specific case of the squared error loss, though it can also be applied to some other losses such as the so-called spherical softmax. One use case could be for training autoencoders with the squared error on very high-dimensional but sparse inputs. While a naive (i.e. what most people currently do) implementation would scale in $O(Dd)$ where $D$ is the input dimensionality and d the hidden layer dimensionality, they show that their trick allows to scale in $O(d^2)$. Their experiments show that they can achieve speedup factors of over 500 on the CPU, and over 1500 on the GPU. #### My two cents This is a really neat, and frankly really surprising, mathematical contribution. I did not suspect getting rid of the dependence on D in the complexity would actually be achievable, even for the "simpler" case of the squared error. The jury is still out as to whether we can leverage the full power of this trick in practice. Indeed, the squared error over sparse targets isn't the most natural choice in most situations. The authors did try to use this trick in the context of a version of the neural network language model that uses the squared error instead of the negative log-softmax (or at least I think that's what was done... I couldn't confirm this with 100% confidence). They showed that good measures of word similarity (Simlex-999) could be achieved in this way, though using the hierarchical softmax actually achieves better performance in about the same time. But as far as I'm concerned, that doesn't make the trick less impressive. It's still a neat piece of new knowledge to have about reconstruction errors. Also, the authors mention that it would be possible to adapt the trick to the so-called (negative log) spherical softmax, which is like the softmax but where the numerator is the square of the pre-activation, instead of the exponential. I hope someone tries this out in the future, as perhaps it could be key to making this trick a real game changer! |

Representation Learning by Rotating Your Faces

Tran, Luan and Yin, Xi and Liu, Xiaoming

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

Tran, Luan and Yin, Xi and Liu, Xiaoming

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

[link]
This paper gets a face image and changes its pose or rotates it (to any desired pose) by passing the target pose as the input to the model. https://i.imgur.com/AGNOag5.png They use a GAN (named DR-GAN) for face rotation. The gan has an encoder and a decoder. The encoder takes the image and gets a high-level feature representation. The decoder gets high-level features, the target pose, and some noise to generate the output image with rotated face. The generated image is then passed to a discriminator where it says whether the image is real or fake. The disc also has two other outputs: 1- it estimates the pose of the generated image, 2) it estimated the identity of the person. no direct loss is applied to the generator, it is trained by the gradient that it gets through discriminator to minimize the three objects: 1- gan loss (to fool disc) 2-pose estimation 3- identity estimation. They use two tricks to improve the model: 1- using the same parameters for encoder of generator (gen-enc) and the discriminator (they observe this helps better identity recognition) 2- passing two images to gen-enc and interpolating between their high-level features (gen-enc output) and then applying two costs on it: 1) gan loss 2) pose loss. These losses are applied through disc, similar to above. The first trick improves gen-enc and second trick improves gen-dec, both help on identification. Their model can also leverage multiple image of the same identity if the dataset provides that to get better latent representation in gen-enc for a given identity. https://i.imgur.com/23Tckqc.png These are some samples on face frontalization: https://i.imgur.com/zmCODXe.png and these are some samples on interpolating different features in latent space: (sub-fig a) interpolating f(x) between the latent space of two images, (sub-fig b) interpolating pose (c), (sub-fig c) interpolating noise: https://i.imgur.com/KlkVyp9.png I find these positive aspects about the paper: 1) face rotation is applied on the images in the wild, 2) It is not required to have paired data. 3) multiple source images of the same identity can be used if provided, 4) identity and pose are used smartly in the discriminator to guide the generator, 5) model can specify the target pose (it is not only face-frontalization). Negative aspects: 1) face has many artifacts, similar to artifacts of some other gan models. 2) The identity is not well-preserved and the faces seem sometime distorted compared to the original person. They show the models performance on identity recognition and face rotation and demonstrate compelling results. |

About