Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Adversarial Autoencoders

Makhzani, Alireza and Shlens, Jonathon and Jaitly, Navdeep and Goodfellow, Ian J.

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

Makhzani, Alireza and Shlens, Jonathon and Jaitly, Navdeep and Goodfellow, Ian J.

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

[link]
#### Summary of this post: * an overview the motivation behind adversarial autoencoders and how they work * a discussion on whether the adversarial training is necessary in the first place. tl;dr: I think it's an overkill and I propose a simpler method along the lines of kernel moment matching. #### Adversarial Autoencoders Again, I recommend everyone interested to read the actual paper, but I'll attempt to give a high level overview the main ideas in the paper. I think the main figure from the paper does a pretty good job explaining how Adversarial Autoencoders are trained: ![](http://www.inference.vc/content/images/2016/01/Screen-Shot-2016-01-08-at-14-48-25.png) The top part of this image is a probabilistic autoencoder. Given the input $\mathbf{x}$, some latent code $\mathbf{z}$ is generated by sampling from an encoding distribution $q(\mathbf{z}\vert\mathbf{x})$. This distribution is typically modeled as the output a deep neural network. In normal autoencoders this encoder would be deterministic, now we allow it to be probabilistic. A decoder network is then trained to decode $\mathbf{z}$ and reconstruct the original input $\mathbf{x}$. Of course, reconstruction will not be perfect, but we train the networks to minimise reconstruction error, this is typically just mean squared error. The reconstruction cost ensures that the encoding process retains information about the input image, but it doesn't enforce anything else about what these latent representations $\mathbf{z}$ should do. In general, their distribution is described as the aggregate posterior $q(\mathbf{z})=\mathbb{E}_\mathbf{x} q(\mathbf{z}\vert\mathbf{x})$. Often, we would like this distribution to match a certain prior $p(\mathbf{z})$. For example. we may want $\mathbf{z}$ to have independent components and Gaussian distributed (nonlinear ICA,PCA). Or we may want to force the latent representations to correspond to discrete class labels, or binary factors. Or we may simply want to ensure there are 'no gaps' in the latent space, and any random $\mathbf{z}$ would lead to a viable sample when squashed through the decoder network. So there are multiple reasons why one might want to control the aggregate posterior $q(\mathbf{z})$ to match a predefined prior $p(\mathbf{z})$. The authors achieve this by introducing an additional term in the autoencoder loss function, one that measures the divergence between $q$ and $p$. The authors chose to do this via adversarial training: they train a discriminator network that constantly learns to discriminate between real code vectors $\mathbb{z}$ produced by encoding real data, and random code vectors sampled from $p$. If $q$ matches $p$ perfectly, the optimal discriminator network should have a large classification error. #### Is this an overkill? My main question about this paper was whether the adversarial cost is really needed here, because I think it's an overkill. Let me explain: Adversarial training is powerful when all else fails to quantify divergence between complicated, potentially degenerate distributions in high dimensions, such as images or video. Our toolkit for dealing with images is limited, CNNs are the best tool we have, so it makes sense to incorporate them in training generative models for images. GANs - when applied directly to images - are a great idea. However, here adversarial training is applied to an easier problem: to quantify the divergence between a simple, fixed prior (e.g. Gaussian) and an empirical distribution of latents. The latent space is usually lower-dimensional, distributions better behaved. Therefore, matching to $p(\mathbf{z})$ in latent space should be considerably easier than matching distributions over images. Adversarial training makes no assumptions about the distributions compared, other than sampling from them. This comes very handy when both $p$ and $q$ are nasty such as in the generative adversarial network scenario: there, $p$ is the distribution of natural images, $q$ is a super complicated, degenerate distribution produced by squashing noise through a deep convnet. The price we pay for this flexibility is this: when $p$ or $q$ are actually easy to work with, adversarial training cannot exploit that, it still has to sample. (it would be interesting to see if expectations over $p(\mathbf{z})$ could be computed analytically). So even though in this work $p$ is as simple as a mixture of ten 2D Gaussians, we need to approximate everything by drawing samples. #### Other things might work: kernel moment matching Why can’t one use easier divergences? For example, I think moment matching based on kernel MMD would work brilliantly in this scenario. It would have the following advantages over the adversarial cost. - closed form expressions: Depending on the choice of the prior $p(\mathbf{z})$ and kernel used in MMD, the expectations over $p$ may be available in closed form, without sampling. So for example if we use a squared exponential kernel and a mixture of Gaussians as $p$, the divergence from $p$ can be precomputed in closed form that is easy to evaluate. - no nasty inner loop: Adversarial training requires the discriminator network to be reoptimised every time the generative model changes. So we end up with a gradient descent in the inner loop of a gradient descent, which is anything but nice to work with. This is why it takes so long to get it working, the whole thing is pretty unstable. In contrast, to evaluate MMD, the inner loop is not needed. In fact, MMD can also be thought of as the solution to a convex maximisation problem, but via the kernel trick the maximum has a closed form solution. - the problem is well suited for MMD: because the distributions are smooth, and the space is nice and low-dimensional, MMD might work very well. Kernel-based methods struggle with complicated manifold-like structure of natural images, so I wouldn't expect MMD to be competitive with adversarial training if it is applied directly in the image space. Therefore, I actually prefer generative adversarial networks to generative moment matching networks. However, here we have an easier problem, simpler space, simpler distributions where MMD shines, and adversarial training is just not needed. |

Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art

Janai, Joel and Güney, Fatma and Behl, Aseem and Geiger, Andreas

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

Janai, Joel and Güney, Fatma and Behl, Aseem and Geiger, Andreas

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

[link]
## Problems * **Computer Vision**: * Detection: Given a 2D image, where are cars, pedestrians, traffic signs? * Depth estimation: Given a 2D image, estimate the depth * **Planning**: Where do I want to go? * **Control**: How should I steer? ## Datasets * KITTI: Street segmentation (Computer Vision) * ISPRS * MOT * Cityscapes ## What I missed * GTSRB: The German Traffic Sign Recognition Benchmark dataset * GTSDB: The German Traffic Sign Detection Benchmark |

Fast R-CNN

Girshick, Ross B.

International Conference on Computer Vision - 2015 via Local Bibsonomy

Keywords: dblp

Girshick, Ross B.

International Conference on Computer Vision - 2015 via Local Bibsonomy

Keywords: dblp

[link]
This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14} 1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training. 2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell." 3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values. This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15} |

Adam: A Method for Stochastic Optimization

Kingma, Diederik P. and Ba, Jimmy

arXiv e-Print archive - 2014 via Local Bibsonomy

Keywords: dblp

Kingma, Diederik P. and Ba, Jimmy

arXiv e-Print archive - 2014 via Local Bibsonomy

Keywords: dblp

[link]
* They suggest a new stochastic optimization method, similar to the existing SGD, Adagrad or RMSProp. * Stochastic optimization methods have to find parameters that minimize/maximize a stochastic function. * A function is stochastic (non-deterministic), if the same set of parameters can generate different results. E.g. the loss of different mini-batches can differ, even when the parameters remain unchanged. Even for the same mini-batch the results can change due to e.g. dropout. * Their method tends to converge faster to optimal parameters than the existing competitors. * Their method can deal with non-stationary distributions (similar to e.g. SGD, Adadelta, RMSProp). * Their method can deal with very sparse or noisy gradients (similar to e.g. Adagrad). ### How * Basic principle * Standard SGD just updates the parameters based on `parameters = parameters - learningRate * gradient`. * Adam operates similar to that, but adds more "cleverness" to the rule. * It assumes that the gradient values have means and variances and tries to estimate these values. * Recall here that the function to optimize is stochastic, so there is some randomness in the gradients. * The mean is also called "the first moment". * The variance is also called "the second (raw) moment". * Then an update rule very similar to SGD would be `parameters = parameters - learningRate * means`. * They instead use the update rule `parameters = parameters - learningRate * means/sqrt(variances)`. * They call `means/sqrt(variances)` a 'Signal to Noise Ratio'. * Basically, if the variance of a specific parameter's gradient is high, it is pretty unclear how it should be changend. So we choose a small step size in the update rule via `learningRate * mean/sqrt(highValue)`. * If the variance is low, it is easier to predict how far to "move", so we choose a larger step size via `learningRate * mean/sqrt(lowValue)`. * Exponential moving averages * In order to approximate the mean and variance values you could simply save the last `T` gradients and then average the values. * That however is a pretty bad idea, because it can lead to high memory demands (e.g. for millions of parameters in CNNs). * A simple average also has the disadvantage, that it would completely ignore all gradients before `T` and weight all of the last `T` gradients identically. In reality, you might want to give more weight to the last couple of gradients. * Instead, they use an exponential moving average, which fixes both problems and simply updates the average at every timestep via the formula `avg = alpha * avg + (1 - alpha) * avg`. * Let the gradient at timestep (batch) `t` be `g`, then we can approximate the mean and variance values using: * `mean = beta1 * mean + (1 - beta1) * g` * `variance = beta2 * variance + (1 - beta2) * g^2`. * `beta1` and `beta2` are hyperparameters of the algorithm. Good values for them seem to be `beta1=0.9` and `beta2=0.999`. * At the start of the algorithm, `mean` and `variance` are initialized to zero-vectors. * Bias correction * Initializing the `mean` and `variance` vectors to zero is an easy and logical step, but has the disadvantage that bias is introduced. * E.g. at the first timestep, the mean of the gradient would be `mean = beta1 * 0 + (1 - beta1) * g`, with `beta1=0.9` then: `mean = 0.9 * g`. So `0.9g`, not `g`. Both the mean and the variance are biased (towards 0). * This seems pretty harmless, but it can be shown that it lowers the convergence speed of the algorithm by quite a bit. * So to fix this pretty they perform bias-corrections of the mean and the variance: * `correctedMean = mean / (1-beta1^t)` (where `t` is the timestep). * `correctedVariance = variance / (1-beta2^t)`. * Both formulas are applied at every timestep after the exponential moving averages (they do not influence the next timestep). ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adam__algorithm.png?raw=true "Algorithm") |

Building Machines That Learn and Think Like People

Lake, Brenden M. and Ullman, Tomer D. and Tenenbaum, Joshua B. and Gershman, Samuel J.

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Lake, Brenden M. and Ullman, Tomer D. and Tenenbaum, Joshua B. and Gershman, Samuel J.

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
TLDR; The author explore the gap between Deep Learning methods and human learning. The argue that natural intelligence is still the best example of intelligence, so it's worth exploring. To demonstrate their points they explore two challenges: 1. Recognizing new characters and objects 2. Learning to play the game Frostbite. The authors make several arguments: - Humans have an intuitive understanding of physics and psychology (understanding goals and agents) very early on. These two types of "software" help them to learn new tasks quickly. - Humans build causal models of the world instead of just performing pattern recognition. These models allow humans to learn from far fewer examples than current Deep Learning methods. For example, AlphaGo played a billion games or so, Lee Sedol perhaps 50,000. Incorporating compositionality, learning-to-learn (transfer learning) and causality helps humans to build these models. - Humans use both model-free and model-based learning algorithms. |

About