Welcome to ShortScience.org! 
[link]
[code](https://github.com/openai/improvedgan), [demo](http://infinitechamber35121.herokuapp.com/cifarminibatch/1/?), [related](http://www.inference.vc/understandingminibatchdiscriminationingans/) ### Feature matching problem: overtraining on the current discriminator solution: ￼$E_{x \sim p_{\text{data}}}f(x)  E_{z \sim p_{z}(z)}f(G(z))_{2}^{2}$ were f(x) activations intermediate layer in discriminator ### Minibatch discrimination problem: generator to collapse to a single point solution: for each sample i, concatenate to $f(x_i)$ features $b$ measuring its distance to other samples j (i and j are both real or generated samples in same batch): $\sum_j \exp(M_{i, b}  M_{j, b}_{L_1})$ ￼ this generates visually appealing samples very quickly ### Historical averaging problem: SGD fails by going into extended orbits solution: parameters revert to the mean $ \theta  \frac{1}{t} \sum_{i=1}^t \theta[i] ^2$ ￼ ### Onesided label smoothing problem: discriminator vulnerability to adversarial examples solution: discriminator target for positive samples is 0.9 instead of 1 ### Virtual batch normalization problem: using BN cause output of examples in batch to be dependent solution: use reference batch chosen once at start of training and each sample is normalized using itself and the reference. It's expensive so used only on generation ### Assessment of image quality problem: MTurk not reliable solution: use inception model p(yx) to compute ￼$\exp(\mathbb{E}_x \text{KL}(p(y  x)  p(y)))$ on 50K generated images x ### Semisupervised learning use the discriminator to also classify on K labels when known and use all real samples (labels and unlabeled) in the discrimination task ￼$D(x) = \frac{Z(x)}{Z(x) + 1}, \text{ where } Z(x) = \sum_{k=1}^{K} \exp[l_k(x)]$. In this case use feature matching but not minibatch discrimination. It also improves the quality of generated images.
3 Comments

[link]
This paper introduces a neural network architecture that is deeper and wider, yet optimizing for computational efficiency by approximating the expected sparse structure (following from Arora et al's work) using readily available dense blocks. An ensemble of 7 models (all with the same architecture but different image sampling) achieved top spot in the classification task at ILSVRC2014. "Their main result states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs." Main contributions:  A more generalized exploration of the NIN architecture, called the Inception module.  1x1 convolutions to capture dense information clusters  3x3 and 5x5 to capture more spatially spread out clusters  Ratio of 3x3 and 5x5 to 1x1 convolutions increases as we go deeper as features of higher abstraction are less spatially concentrated.  To avoid the blowup of output channels cause by merging outputs of convolutional layers and pooling layer, they use 1x1 convolutions for dimensionality reduction. This has the added benefit of another layer of nonlinearity (and thus increasing discriminative capability).  Multiple intermediate layers are tied to the objective function. Since features produced by intermediate layers of a deep network are supposed to be very discriminative, and to strengthen the gradient signal passing through them during backpropagation, they attach auxiliary classifiers to intermediate layers.  During training, they do a weighted sum of this loss with the total loss of the network.  At test time, these auxiliary networks are discarded.  Architecture: average pooling, 1x1 convolution (for dimensionality reduction), dropout, linear layer with softmax. ## Strengths  Excellent results on ILSVRC2014. ## Weaknesses / Notes  Even though the authors try to explain some of the intuition, most of the design decisions seem arbitrary. 
[link]
(See also a more thorough summary in [a LaTeX PDF][1].) This paper has some nice clear theory which bridges maximum likelihood (supervised) learning and standard reinforcement learning. It focuses on *structured prediction* tasks, where we want to learn to predict $p_\theta(y \mid x)$ where $y$ is some object with complex internal structure. We can agree on some deficiencies of maximum likelihood learning:  ML training fails to assign **partial credit**. Models are trained to maximize the likelihood of the groundtruth outputs in the dataset, and all other outputs are equally wrong. This is an increasingly important problem as the space of possible solutions grows.  ML training is potentially disconnected from **downstream task reward**. In machine translation, we usually want to optimize relatively complex metrics like BLEU or TER. Since these metrics are nondifferentiable, we have to settle for optimizing proxy losses that we hope are related to the metric of interest. Reinforcement learning offers an attractive alternative in theory. RL algorithms are designed to optimize nondifferentiable (even stochastic) reward functions, which sounds like just what we want. But RL algorithms have their own problems with this sort of structured output space:  Standard RL algorithms rely on samples from the model we are learning, $p_\theta(y \mid x)$. This becomes intractable when our output space is very complex (e.g. 80token sequences where each word is drawn from a vocabulary of 80,000 words).  The reward spaces for problems of interest are extremely sparse. Our metrics will assign 0 reward to most of the 80^80K possible outputs in the translation problem in the paper.  Vanilla RL doesn't take into account the groundtruth outputs available to us in structured prediction. This paper designs a solution which combines supervised learning with a reinforcement learninginspired smoothing method. Concretely, the authors design an **exponentiated payoff distribution** $q(y \mid y^*; \tau)$ which assigns high mass to highreward outputs $y$ and low mass elsewhere. This distribution is used to effectively smooth the loss function established by the groundtruth outputs in the supervised data. We end up optimizing the following objective: $$\mathcal L_\text{RML} =  \mathbb E_{x, y^* \sim \mathcal D}\left[ \sum_y q(y \mid y^*; \tau) \log p_\theta(y \mid x) \right]$$ This optimization depends on samples from our dataset $\mathcal D$ and, more importantly, the stationary payoff distribution $q$. This contrasts strongly with standard RL training, where the objective depends on samples from the nonstationary model distribution $p_\theta$. To make that clear, we can rewrite the above with another expectation: $$\mathcal L_\text{RML} =  \mathbb E_{x, y^* \sim \mathcal D, y \sim q(y \mid y^*; \tau)}\left[ \log p_\theta(y \mid x) \right]$$ ### Model details If you're interested in the lowlevel details, I wrote up the gist of the math in [this PDF][1]. ### Analysis #### Relationship to label smoothing This training approach is mathematically equivalent to label smoothing, applied here to structured output problems. In nextword prediction language modeling, a popular trick involves smoothing the target distributions by combining the groundtruth output with some simple base model, e.g. a unigram word frequency distribution. (This just means we take a weighted sum of the onehot vector from our supervised data and a normalized frequency vector calculated on some corpus.) Mathematically, the cross entropy with label smoothing is $$\mathcal L_\text{MLsmooth} =  \mathbb E_{x, y^* \sim \mathcal D} \left[ \sum_y p_\text{smooth}(y; y^*) \log p_\theta(y \mid x) \right]$$ (The equation above leaves out a constant entropy term.) The gradient of this objective looks exactly the same as the rewardaugmented ML gradient from the paper: $$\nabla_\theta \mathcal L_\text{MLsmooth} = \mathbb E_{x, y^* \sim \mathcal D, y \sim p_\text{smooth}} \left[ \log p_\theta(y \mid x) \right]$$ So rewardaugmented likelihood is equivalent to label smoothing, where our smoothing distribution is logproportional to our downstream reward function. #### Relationship to distillation Optimizing the rewardaugmented maximum likelihood is equivalent to minimizing the KL divergence $$D_\text{KL}(q(y \mid y^*; \tau) \mid\mid p_\theta(y \mid x))$$ This divergence reaches zero iff $q = p$. We can say, then, that the effect of optimizing on $\mathcal L_\text{RML}$ is to **distill** the reward function (which parameterizes $q$) into the model parameters $\theta$ (which parameterize $p_\theta$). It's exciting to think about other sorts of more complex models that we might be able to distill in this framework. The unfortunate (?) restriction is that the "source" model of the distillation ($q$ in this paper) must admit to efficient sampling. #### Relationship to adversarial training We can also view rewardaugmented maximum likelihood training as a data augmentation technique: it synthesizes new "partially correct" examples using the reward function as a guide. We then train on all of the original and synthesized data, again weighting the gradients based on the reward function. Adversarial training is a similar data augmentation technique which generates examples that force the model to be robust to changes in its input space (robust to changes of $x$). Both adversarial training and the RML objective encourage the model to be robust "near" the groundtruth supervised data. A highlevel comparison:  Adversarial training can be seen as data augmentation in the input space; RML training performs data augmentation in the output space.  Adversarial training is a **modelbased data augmentation**: the samples are generated from a process that depends on the current parameters during training. RML training performs **databased augmentation**, which could in theory be done independent of the actual training process.  Thanks to Andrej Karpathy, Alec Radford, and Tim Salimans for interesting discussion which contributed to this summary. [1]: https://drive.google.com/file/d/0B3Rdm_P3VbRDVUQ4SVBRYW82dU0/view 
[link]
Spatial Pyramid Pooling (SPP) is a technique which allows Convolutional Neural Networks (CNNs) to use input images of any size, not only $224\text{px} \times 224\text{px}$ as most architectures do. (However, there is a lower bound for the size of the input image). ## Idea * Convolutional layers operate on any size, but fully connected layers need fixedsize inputs * Solution: * Add a new SPP layer on top of the last convolutional layer, before the fully connected layer * Use an approach similar to bag of words (BoW), but maintain the spatial information. The BoW approach is used for text classification, where the order of the words is discarded and only the number of occurences is kept. * The SPP layer operates on each feature map independently. * The output of the SPP layer is of dimension $k \cdot M$, where $k$ is the number of feature maps the SPP layer got as input and $M$ is the number of bins. Example: We could use spatial pyramid pooling with 21 bins: * 1 bin which is the max of the complete feature map * 4 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region. * 16 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region. ## Evaluation * Pascal VOC 2007, Caltech101: stateoftheart, without finetuning * ImageNet 2012: Boosts accuracy for various CNN architectures * ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014: Rank #2 ## Code The paper claims that the code is [here](http://research.microsoft.com/enus/um/people/kahe/), but this seems not to be the case any more. People have tried to implement it with Tensorflow ([1](http://stackoverflow.com/q/40913794/562769), [2](https://github.com/fchollet/keras/issues/2080), [3](https://github.com/tensorflow/tensorflow/issues/6011)), but by now no public working implementation is available. ## Related papers * [Atrous Convolution](https://arxiv.org/abs/1606.00915)
1 Comments

[link]
[Batch Normalization Ioffe et. al 2015](Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) is one of the remarkable ideas in the era of deep learning that sits with the likes of Dropout and Residual Connections. Nonetheless, last few years have shown a few shortcomings of the idea, which two years later Ioffe has tried to solve through the concept that he calls Batch Renormalization. Issues with Batch Normalization  Different parameters used to compute normalized output during training and inference  Using Batch Norm with small minibatches  Noni.i.d minibatches can have a detrimental effect on models with batchnorm. For e.g. in a metric learning scenario, for a minibatch of size 32, we may randomly select 16 labels then choose 2 examples for each of these labels, the examples interact at every layer and may cause model to overfit to the specific distribution of minibatches and suffer when used on individual examples. The problem with using moving averages in training, is that it causes gradient optimization and normalization in opposite direction and leads to model blowing up. Idea of Batch Renormalization We know that, ${\frac{x_i  \mu}{\sigma} = \frac{x_i  \mu_B}{\sigma_B}.r + d}$ where, ${r = \frac{\sigma_B}{\sigma}, d = \frac{\mu_B  \mu}{\sigma}}$ So the batch renormalization algorithm is defined as follows ![Batch Renorm Algo](https://fractalanalyticmy.sharepoint.com/personal/shubham_jain_fractalanalytics_com/_layouts/15/guestaccess.aspx?docid=0c2c627424786442f8de65367755e1fd1&authkey=ARSCi3QfpM_uBVuWCYARKNg) Ioffe writes further that for practical purposes, > In practice, it is beneficial to train the model for a certain number of iterations with batchnorm alone, without the correction, then ramp up the amount of allowed correction. We do this by imposing bounds on r and d, which initially constrain them to 1 and 0, respectively, and then are gradually relaxed. In experiments, For Batch Renorm, author used $r_{max}$ = 1, $d_{max}$ = 0 (i.e. simply batchnorm) for the first 5000 training steps, after which these were gradually relaxed to reach $r_{max}$ = 3 at 40k steps, and $d_{max}$ = 5 at 25k steps. A training step means, an update to the model.
2 Comments
