Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Deep Residual Learning for Image Recognition

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) |

Swapout: Learning an ensemble of deep architectures

Saurabh Singh and Derek Hoiem and David Forsyth

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV, cs.LG, cs.NE

**First published:** 2016/05/20 (8 years ago)

**Abstract:** We describe Swapout, a new stochastic training method, that outperforms
ResNets of identical network structure yielding impressive results on CIFAR-10
and CIFAR-100. Swapout samples from a rich set of architectures including
dropout, stochastic depth and residual architectures as special cases. When
viewed as a regularization method swapout not only inhibits co-adaptation of
units in a layer, similar to dropout, but also across network layers. We
conjecture that swapout achieves strong regularization by implicitly tying the
parameters across layers. When viewed as an ensemble training method, it
samples a much richer set of architectures than existing methods such as
dropout or stochastic depth. We propose a parameterization that reveals
connections to exiting architectures and suggests a much richer set of
architectures to be explored. We show that our formulation suggests an
efficient training method and validate our conclusions on CIFAR-10 and
CIFAR-100 matching state of the art accuracy. Remarkably, our 32 layer wider
model performs similar to a 1001 layer ResNet model.
more
less

Saurabh Singh and Derek Hoiem and David Forsyth

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV, cs.LG, cs.NE

[link]
This paper presents Swapout, a simple dropout method applied to Residual Networks (ResNets). In a ResNet, a layer $Y$ is computed from the previous layer $X$ as $Y = X + F(X)$ where $F(X)$ is essentially the composition of a few convolutional layers. Swapout simply applies dropout separately on both terms of a layer's equation: $Y = \Theta_1 \odot X + \Theta_2 \odot F(X)$ where $\Theta_1$ and $\Theta_2$ are independent dropout masks for each term. The paper shows that this form of dropout is at least as good or superior as other forms of dropout, including the recently proposed [stochastic depth dropout][1]. Much like in the stochastic depth paper, better performance is achieved by linearly increasing the dropout rate (from 0 to 0.5) from the first hidden layer to the last. In addition to this observation, I also note the following empirical observations: 1. At test time, averaging the output layers of multiple dropout mask samples (referenced to as stochastic inference) is better than replacing the masks by their expectation (deterministic inference), the latter being the usual standard. 2. Comparable performance is achieved by making the ResNet wider (e.g. 4 times) and with fewer layers (e.g. 32) than the orignal ResNet work with thin but very deep (more than 1000 layers) ResNets. This would confirm a similar observation from [this paper][2]. Overall, these are useful observations to be aware of for anyone wanting to use ResNets in practice. [1]: http://arxiv.org/abs/1603.09382v1 [2]: https://arxiv.org/abs/1605.07146 |

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J. Maddison and Andriy Mnih and Yee Whye Teh

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2016/11/02 (7 years ago)

**Abstract:** The reparameterization trick enables the optimization of large scale
stochastic computation graphs via gradient descent. The essence of the trick is
to refactor each stochastic node into a differentiable function of its
parameters and a random variable with fixed distribution. After refactoring,
the gradients of the loss propagated by the chain rule through the graph are
low variance unbiased estimators of the gradients of the expected loss. While
many continuous random variables have such reparameterizations, discrete random
variables lack continuous reparameterizations due to the discontinuous nature
of discrete states. In this work we introduce concrete random variables --
continuous relaxations of discrete random variables. The concrete distribution
is a new family of distributions with closed form densities and a simple
reparameterization. Whenever a discrete stochastic node of a computation graph
can be refactored into a one-hot bit representation that is treated
continuously, concrete stochastic nodes can be used with automatic
differentiation to produce low-variance biased gradients of objectives
(including objectives that depend on the log-likelihood of latent stochastic
nodes) on the corresponding discrete graph. We demonstrate their effectiveness
on density estimation and structured prediction tasks using neural networks.
more
less

Chris J. Maddison and Andriy Mnih and Yee Whye Teh

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
This paper presents a way to differentiate through discrete random variables by replacing them with continuous random variables. Say you have a discrete [categorical variable][cat] and you're sampling it with the [Gumbel trick][gumbel] like this ($G_k$ is a Gumbel distributed variable and $\boldsymbol{\alpha}/\sum_k \alpha_k$ are our categorical probabilities): $$ z = \text{one_hot} \left( \underset{k}{\text{arg max}} [ G_k + \log \alpha_k ] \right) $$ This paper replaces the one hot and argmax with a softmax, and they add a $\lambda$ variable to control the "temperature". As $\lambda$ tends to zero it will equal the above equation. $$ z = \text{softmax} \left( \frac{ G_k + \log \alpha_k }{\lambda} \right) $$ I made [some notes][nb] on how this process works, if you'd like more intuition. Comparison to [Gumbel-softmax][gs] -------------------------------------------- These papers are proposed precisely the same distribution with notation changes ([noted there][gs]). Both papers also reference each other and the differences. Although, they mention differences in the variatonal objectives to the Gumbel-softmax. This paper also compares to [VIMCO][], which is probably a harder benchmark to compare against (multi-sample versus single sample). The results in both papers compare to SOTA score function based estimators and both report high scoring results (often the best). There are some details about implementations to consider though, such as scheduling and exactly how to define the variational objective. [cat]: https://en.wikipedia.org/wiki/Categorical_distribution [gumbel]: https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/ [gs]: http://www.shortscience.org/paper?bibtexKey=journals/corr/JangGP16 [nb]: https://gist.github.com/gngdb/ef1999ce3a8e0c5cc2ed35f488e19748 [vimco]: https://arxiv.org/abs/1602.06725 |

Neural Architecture Search with Reinforcement Learning

Barret Zoph and Quoc V. Le

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.AI, cs.NE

**First published:** 2016/11/05 (7 years ago)

**Abstract:** Neural networks are powerful and flexible models that work well for many
difficult learning tasks in image, speech and natural language understanding.
Despite their success, neural networks are still hard to design. In this paper,
we use a recurrent network to generate the model descriptions of neural
networks and train this RNN with reinforcement learning to maximize the
expected accuracy of the generated architectures on a validation set. On the
CIFAR-10 dataset, our method, starting from scratch, can design a novel network
architecture that rivals the best human-invented architecture in terms of test
set accuracy. Our CIFAR-10 model achieves a test error rate of 3.84, which is
only 0.1 percent worse and 1.2x faster than the current state-of-the-art model.
On the Penn Treebank dataset, our model can compose a novel recurrent cell that
outperforms the widely-used LSTM cell, and other state-of-the-art baselines.
Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is
3.6 perplexity better than the previous state-of-the-art.
more
less

Barret Zoph and Quoc V. Le

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.AI, cs.NE

[link]
### Main Idea: It basically tunes the hyper-parameters of the neural network architecture using reinforcement learning. The reward signal is taken as evaluation on the validation set. The method is policy gradient as the cost function is non-differentiable. ### Method: #### i. Actions: 1. There is controller RNN which predicts some hyper-parameters of the layer conditioned on the previous predictions. This prediction is just a one-hot vector based on the previous hyper-parameter chosen. At the start for the first prediction - this vector is just all zeros. 2. Once the network is generated completely, it is trained for a fixed number of epochs, the reward signal is calculated based on the evaluation on the validation set. #### ii. Training: 1. Nothing fancy in the reinforcement learning approach simple policy gradients. 2. Baseline is added to reduce the variance. 3. It takes 2-3 weeks to train it over 800 GPUs!! #### iii. Results: 1. Use it to generate CNNs and LSTM cells. Close to state-of-art results with generated architectures. ### Possible new directions: 1. Use better techniques RL techniques like TRPO, PPO etc. 2. Right now, they generate fixed length architecture. Their reason is for variable length architectures, it is difficult to determine how much time each architecture is trained. Smaller networks are easier to train. Thus, somehow determine training time as function of the learning capacity of the network. Code: They haven't released the code yet. I tried to simulate it in torch.(https://github.com/abhigenie92/nn_search) |

Learning by Asking Questions

Ishan Misra and Ross Girshick and Rob Fergus and Martial Hebert and Abhinav Gupta and Laurens van der Maaten

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.CV, cs.CL, cs.LG

**First published:** 2017/12/04 (6 years ago)

**Abstract:** We introduce an interactive learning framework for the development and
testing of intelligent visual systems, called learning-by-asking (LBA). We
explore LBA in context of the Visual Question Answering (VQA) task. LBA differs
from standard VQA training in that most questions are not observed during
training time, and the learner must ask questions it wants answers to. Thus,
LBA more closely mimics natural learning and has the potential to be more
data-efficient than the traditional VQA setting. We present a model that
performs LBA on the CLEVR dataset, and show that it automatically discovers an
easy-to-hard curriculum when learning interactively from an oracle. Our LBA
generated data consistently matches or outperforms the CLEVR train data and is
more sample efficient. We also show that our model asks questions that
generalize to state-of-the-art VQA models and to novel test time distributions.
more
less

Ishan Misra and Ross Girshick and Rob Fergus and Martial Hebert and Abhinav Gupta and Laurens van der Maaten

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.CV, cs.CL, cs.LG

[link]
This paper is about interactive Visual Question Answering (VQA) setting in which agents must ask questions about images to learn. This closely mimics how people learn from each other using natural language and has a strong potential to learn much faster with fewer data. It is referred as learning by asking (LBA) through the paper. The approach is composed of three models: http://imisra.github.io/projects/lba/approach_HQ.jpeg 1. **Question proposal module** is responsible for generating _important_ questions about the image. It is a combination of 2 models: - **Question generator** model produces a question. It is LSTM that takes image features and question type (random choice from available options) as input and outputs a question. - **Question relevance** model that selects questions relevant to the image. It is a stacked attention architecture network (shown below) that takes in generated question and image features and filters out irrelevant to the image questions. https://i.imgur.com/awPcvYz.png 2. **VQA module** learns to predict answer given the image features and question. It is implemented as stacked attention architecture shown above. 3. **Question selection module** selects the most informative question to ask. It takes current state of VQA module and its output to calculate expected accuracy improvement (details are in the paper) to measure how fast the VQA module has a potential to improve for each answer. The single question selection (i.e. best question for VQA to improve the fastest) strategy is based on epsilon-greedy policy. This method (i.e. LBA) is shown to be about 50% more data efficient than naive VQA method. As an interesting future direction of this work, the authors propose to use real-world images and include a human in the training as an answer provider. |

About