![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) ![]() |
[link]
This paper described an algorithm of parametrically adding noise and applying a variational regulariser similar to that in ["Variational Dropout Sparsifies Deep Neural Networks"][vardrop]. Both have the same goal: make neural networks more efficient by removing parameters (and therefore the computation applied with those parameters). Although, this paper also has the goal of giving a prescription of how many bits to store each parameter with as well. There is a very nice derivation of the hierarchical variational approximation being used here, which I won't try to replicate here. In practice, the difference to prior work is that the stochastic gradient variational method uses hierarchical samples; ie it samples from a prior, then incorporates these samples when sampling over the weights (both applied through local reparameterization tricks). It's a powerful method, which allows them to test two different priors (although they are clearly not limited to just these), and compare both against competing methods. They are comparable, and the choice of prior offers some tradeoffs in terms of sparsity versus quantization. [vardrop]: http://www.shortscience.org/paper?bibtexKey=journals/corr/1701.05369 ![]() |
[link]
The paper introduces a sequential variational auto-encoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data. #### What is DRAW: The deep recurrent attention writer (DRAW) model has two differences with respect to other variational auto-encoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder. #### What do we gain? The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem. #### What follows? A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network. #### Like: * As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way. * The attention model is fully differentiable. #### Dislike: * I think a better exposition of the attention mechanism would improve this paper. ![]() |
[link]
This paper proposes a neural architecture that allows to backpropagate gradients though a procedure that can go through a variable and adaptive number of iterations. These "iterations" for instance could be the number of times computations are passed through the same recurrent layer (connected to the same input) before producing an output, which is the case considered in this paper. This is essentially achieved by pooling the recurrent states and respective outputs computed by each iteration. The pooling mechanism is essentially the same as that used in the really cool Neural Stack architecture of Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman and Phil Blunsom \cite{conf/nips/GrefenstetteHSB15}. It relies on the introduction of halting units, which are sigmoidal units computed at each iteration and which gives a soft weight on whether the computation should stop at the current iteration. Crucially, the paper introduces a new ponder cost $P(x)$, which is a regularization cost that penalizes what is meant to be a smooth upper bound on the number of iterations $N(t)$ (more on that below). The paper presents experiment on RNNs applied on sequences where, at each time step t (not to be confused with what I'm calling computation iterations, which are indexed by n) in the sequence the RNN can produce a variable number $N(t)$ of intermediate states and outputs. These are the states and outputs that are pooled, to produce a single recurrent state and output for the time step t. During each of the $N(t)$ iterations at time step t, the intermediate states are connected to the same time-step-t input. After the $N(t)$ iterations, the RNN pools the $N(t)$ intermediate states and outputs, and then moves to the next time step $t+1$. To mark the transitions between time steps, an extra binary input is appended, which is 1 only for the first intermediate computation iteration. Results are presented on a variety of synthetic problems and a character prediction problem. ![]() |
[link]
# Deep Convolutional Generative Adversarial Nets ## Introduction * The paper presents Deep Convolutional Generative Adversarial Nets (DCGAN) - a topologically constrained variant of conditional GAN. * [Link to the paper](https://arxiv.org/abs/1511.06434) ## Benefits * Stable to train * Very useful to learn unsupervised image representations. ## Model * GANs difficult to scale using CNNs. * Paper proposes following changes to GANs: * Replace any pooling layers with strided convolutions (for discriminator) and fractional strided convolutions (for generators). * Remove fully connected hidden layers. * Use batch normalisation in both generator (all layers except output layer) and discriminator (all layers except input layer). * Use LeakyReLU in all layers of the discriminator. * Use ReLU activation in all layers of the generator (except output layer which uses Tanh). ## Datasets * Large-Scale Scene Understanding. * Imagenet-1K. * Faces dataset. ## Hyperparameters * Minibatch SGD with minibatch size of 128. * Weights initialized with 0 centered Normal distribution with standard deviation = 0.02 * Adam Optimizer * Slope of leak = 0.2 for LeakyReLU. * Learning rate = 0.0002, β1 = 0.5 ## Observations * Large-Scale Scene Understanding data * Demonstrates that model scales with more data and higher resolution generation. * Even though it is unlikely that model would have memorized images (due to low learning rate of minibatch SGD). * Classifying CIFAR-10 dataset * Features * Train in Imagenet-1K and test on CIFAR-10. * Max pool discriminator's convolutional features (from all layers) to get 4x4 spatial grids. * Flatten and concatenate to get a 28672-dimensional vector. * Linear L2-SVM classifier trained over the feature vector. * 82.8% accuracy, outperforms K-means (80.6%) * Street View House Number Classifier * Similar pipeline as CIFAR-10 * 22.48% test error. * The paper contains many examples of images generated by final and intermediate layers of the network. * Images in the latent space do not show sharp transitions indicating that network did not memorize images. * DCGAN can learn an interesting hierarchy of features. * Networks seems to have some success in disentangling image representation from object representation. * Vector arithmetic can be performed on the Z vectors corresponding to the face samples to get results like `smiling woman - normal woman + normal man = smiling man` visually. ![]() |