Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Latent Predictor Networks for Code Generation

Ling, Wang and Grefenstette, Edward and Hermann, Karl Moritz and Kociský, Tomás and Senior, Andrew and Wang, Fumin and Blunsom, Phil

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Ling, Wang and Grefenstette, Edward and Hermann, Karl Moritz and Kociský, Tomás and Senior, Andrew and Wang, Fumin and Blunsom, Phil

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
This paper presents a conditional generative model of text, where text can be generated either one character at a time or by copying some full chunks of character taken directly from the input into the output. At each step of the generation, the model can decide which of these two modes of generation to use, mixing them as needed to generate a correct output. They refer to this structure for generation as Latent Predictor Networks \cite{conf/nips/VinyalsFJ15}. The character-level generation part of the model is based on a simple output softmax over characters, while the generation-by-copy component is based on a Pointer Network architecture. Critically, the authors highlight that it is possible to marginalize over the use of either types of components by dynamic programming as used in semi-Markov models \cite{conf/nips/SarawagiC04}. One motivating application is machine translation, where the input might contain some named entities that should just be directly copied at the output. However, the authors experiment on a different problem, that of generating code that would implement the action of a card in the trading card games Magic the Gathering and Hearthstone. In this application, copying is useful to do things such as copy the name of the card or its numerically-valued effects. In addition to the Latent Predictor Network structure, the proposed model for this application includes a slightly adapted form of soft-attention as well as character-aware word embeddings as in \cite{conf/emnlp/LingDBTFAML15} Also, the authors experiment with a compression procedure on the target programs, that can help in reducing the size of the output space. Experiments show that the proposed neural network approach outperforms a variety of strong baselines (including systems based on machine translation or information retrieval). |

Deep Networks with Stochastic Depth

Huang, Gao and Sun, Yu and Liu, Zhuang and Sedra, Daniel and Weinberger, Kilian

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: deeplearning, acreuser

Huang, Gao and Sun, Yu and Liu, Zhuang and Sedra, Daniel and Weinberger, Kilian

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: deeplearning, acreuser

[link]
**Dropout for layers** sums it up pretty well. The authors built on the idea of [deep residual networks](http://arxiv.org/abs/1512.03385) to use identity functions to skip layers. The main advantages: * Training speed-ups by about 25% * Huge networks without overfitting ## Evaluation * [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html): 4.91% error ([SotA](https://martin-thoma.com/sota/#image-classification): 2.72 %) Training Time: ~15h * [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html): 24.58% ([SotA](https://martin-thoma.com/sota/#image-classification): 17.18 %) Training time: < 16h * [SVHN](http://ufldl.stanford.edu/housenumbers/): 1.75% ([SotA](https://martin-thoma.com/sota/#image-classification): 1.59 %) - trained for 50 epochs, begging with a LR of 0.1, divided by 10 after 30 epochs and 35. Training time: < 26h |

SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition

Camgöz, Necati Cihan and Hadfield, Simon and Koller, Oscar and Bowden, Richard

International Conference on Computer Vision - 2017 via Local Bibsonomy

Keywords: dblp

Camgöz, Necati Cihan and Hadfield, Simon and Koller, Oscar and Bowden, Richard

International Conference on Computer Vision - 2017 via Local Bibsonomy

Keywords: dblp

[link]
This paper tackles a challenging task of hand shape and continuous Sign Language Recognition (SLR) directly from images obtained from a common RGB camera (rather than utilizing motion sensors like Kinect). The basic idea is to create a network that is end-to-end trainable with input (i.e. images) and output (i.e. hand shape labels, word labels) sequences. The network is composed of three parts: - CNN as a feature extractor - Bidirectional LSTMs for temporal modeling - Connectionist Temporal Classification as a loss layer ![Network structure](https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/3269d3541f0eec006aee6ce086db2665b7ded92d/1-Figure1-1.png) Results: - Observed state-of-art results (at the time of publishing) on "One-Million Hands" and "RWTH-PHOENIX-Weather-2014" datasets. - Utilizing full images rather than hand patches provides better performance for continuous SLR. - A network that recognizes hand shape and a network that recognizes word sequence can be combined and trained together to recognize word sequences. Finetuning combined system from for all layers works better than fixing "feature extraction" layers. - Combination of two networks where each network trained on separate task performs slightly better than training each network on word sequences. - Marginal difference in performance observed for different decoding and post-processing techniques during sequence-to-sequence predictions. |

Fast Instance and Semantic Segmentation Exploiting Local Connectivity, Metric Learning, and One-Shot Detection for Robotics

Milioto, Andres and Mandtler, Leonard P. and Stachniss, Cyrill

International Conference on Robotics and Automation - 2019 via Local Bibsonomy

Keywords: dblp

Milioto, Andres and Mandtler, Leonard P. and Stachniss, Cyrill

International Conference on Robotics and Automation - 2019 via Local Bibsonomy

Keywords: dblp

[link]
The paper proposes a method to perform joint instance and semantic segmentation. The method is fast as it is meant to run in an embedded environment (such as a robot). While the semantic map may seem redundant given the instance one, it is not as semantic segmentation is a key part of obtaining the instance map. # Architecture ![image](https://user-images.githubusercontent.com/8659132/63187959-24cdb380-c02e-11e9-9121-77e0923e91c6.png) The image is first put through a typical CNN encoder (specifically a ResNet derivative), followed by 3 separate decoders. The output of the decoder is at a low resolution for faster processing. Decoders: - Semantic segmentation: coupled with the encoder, it's U-Net-like. The output is a segmentation map. - Instance center: for each pixel, outputs the confidence that it is the center of an object. - Embedding: for each pixel, computes a 32 dimensional embedding. This embedding must have a low distance to embedding of other pixels of the same instance, and high distance to embedding of other pixels. To obtain the instance map, the segmentation map is used to mask the other 2 decoder outputs to separate the embeddings and centers of each class. Centers are thresholded at 0.7, and centers with embedding distances lower than a set amount are discarded, as they are considered duplicates. Then for each class, a similarity matrix is computed between all pixels from that class and centers from that class. Pixels are assigned to their closest centers, which represent different instances of the class. Finally, the segmentation and instance maps are upsampled using the SLIC algorithm. # Loss There is one loss for each decoder head. - Semantic segmentation: weighted cross-entropy - Instance center: cross-entropy term modulated by a $\gamma$ parameter to counter the over-representation of the background over the target classes. ![image](https://user-images.githubusercontent.com/8659132/63286485-22659680-c286-11e9-9134-f1b823a34217.png) - Embedding: composed of 3 parts, an attracting force between embeddings of the same instance, a repelling force between embeddings of different instances, and a l2 regularization on the embedding. ![image](https://user-images.githubusercontent.com/8659132/63286399-f1856180-c285-11e9-9136-feb6c4a555e5.png) ![image](https://user-images.githubusercontent.com/8659132/63286411-fcd88d00-c285-11e9-939f-0771579d8263.png) $\hat{e}$ are the embeddings, $\delta_a$ is an hyper-parameter defining "close enough", and $\delta_b$ defines "far enough" The whole model is trained jointly using a weighted sum of the 3 losses. # Experiments and results The authors test their method on the Cityscape dataset, which is composed of 5000 annotated images and 8 instance classes. They compare their methods both for semantic segmentation and instance segmentation. ![image](https://user-images.githubusercontent.com/8659132/63287573-a882dc80-c288-11e9-83e0-b352e43bdf28.png) For semantic segmentation, their method is ok, though ENet for example performs better on average and is much faster. ![image](https://user-images.githubusercontent.com/8659132/63287643-d700b780-c288-11e9-9d40-5bcaf695a744.png) On the other hand, for instance segmentation, their method is much faster than the other while still performing well. Not SOTA on performance, but considering the real-time constraint, it's much better. # Comments - Most instance segmentation methods tend to be sluggish and overly complicated. This approach is much more elegant in my opinion. - If they removed the aggressive down/up sampling, I wonder if they would beat MaskRCNN and PANet. - I'm not sure what's the point of upsampling the semantic map given that we already have the instance map. |

Variational inference for Monte Carlo objectives

Mnih, Andriy and Rezende, Danilo Jimenez

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Mnih, Andriy and Rezende, Danilo Jimenez

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
This paper explores the use of so-called Monte Carlo objectives for training directed generative models with latent variables. Monte Carlo objectives take the form of the logarithm of a Monte Carlo estimate (i.e. an average over samples) of the marginal probability $P(x)$. One important motivation for using Monte Carlo objectives is that they can be shown (see the Importance Weighted Variational Autoencoder paper \cite{journals/corr/BurdaGS15} and my notes on it) to correspond to bounds on the true likelihood of the model, and one can tighten the bound simply by drawing more samples in the Monte Carlo objective. Currently, the most successful application of Monte Carlo objectives is based on an importance sampling estimate, which involves training a proposal distribution $Q(h|x)$ in addition to the model $P(x,h)$. This paper considers the problem of training with gradient descent on such objectives, in the context of a model to which the reparametrization trick cannot be used (e.g. for discrete latent variables). They analyze the sources of variance in the estimation of the gradients (see Equation 5) and propose a very simple approach to reducing the variance of a sampling-based estimator of these gradients. First, they argue that gradients with respect to the $P(x,h)$ parameters are less susceptible to problems due to high variance gradients. Second, and most importantly, they derive a multi-sample estimate of the gradient that is meant to reduce the variance of gradients on the proposal distribution parameters $Q(h|x)$. The end result is the gradient estimate of Equations 10-11. It is based on the observation that the first term of the gradient of Equation 5 doesn't distinguish between the contribution of each sampled latent hi. The key contribution is this: they notice that one can incorporate a variance reducing baseline for each sample hi, corresponding to the Monte Carlo estimate of the log-likelihood when removing hi from the estimate (see Equation 10). The authors show that this is a proper baseline, in that using it doesn't introduce a bias in the estimation for the gradients. Experiments show that this approach yields better performance than training based on Reweighted Wake Sleep \cite{journals/corr/BornscheinB14} or the use of NVIL baselines \cite{conf/icml/MnihG14}, when training sigmoid belief networks as generative models or as structured output prediction (image completion) models on binarized MNIST. |

About