ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
Iandola, Forrest N. and Moskewicz, Matthew W. and Ashraf, Khalid and Han, Song and Dally, William J. and Keutzer, Kurt
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by unmesh 9 years ago

$\bf Summary:$
The paper is about squeezing the number of parameters in a convolutional neural network. The number of parameters in a convolutional layer is given by (number of input channels)$\times$(number of filters)$\times$(size of filter$\times$size of filter).

The paper proposes 2 strategies: (i) replace 3x3 filters with 1x1 filters and (ii) decrease the number of input channels. They assume the budget of the filter is given, i,e., they do not tinker with the number of filters. Decrease in number of parameters will lead to less accuracy. To compensate, the authors propose to downsample late in the network. 

The results are quite impressive. Compared to AlexNet, they achieve a 50x reduction is model size while preserving the accuracy. Their model can be further compressed with existing methods like Deep Compression which are orthogonal to this paper's approach and this can give in total of around 510x reduction while still preserving accuracy of AlexNet.

$\bf Question$: The impact on running times (specially on feed forward phase which may be more typical on embedded devices) is not clear to me. Is it certain to be reduced as well or at least be *no worse* than the baseline models?

arxiv.org
scholar.google.com

Improved Techniques for Training GANs
Salimans, Tim and Goodfellow, Ian J. and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Gavin Gray 8 years ago

This is heavily derived from the top voted summary above, but I had to review this paper for a lab meeting so I expanded on it a bit. I hope this doesn't offend anyone, but this should be, at worst, redundant to the summary above.

This paper has two parts: five recommended techniques for training GANs and
a thorough evaluation with visual Turing tests and semi-supervised tasks.
That is more concrete than the feature extraction and visualisation in, for
example, Radford's [DCGAN paper][dcgan].

### Feature Matching

Problem: instability from overtraining on the current discriminator.
Intuition is that the discriminator will have learnt the kind of useful
representation we see in deep image models, and there is more information
available by matching those than the single classifier output.

Solution: Match activations at some hidden with an L2 loss. This is
the same as the "content" loss in the [neural style paper][style]:

$$
\newcommand{\aB}{\mathbf{a}}
\newcommand{\bB}{\mathbf{b}}
\newcommand{\cB}{\mathbf{c}}
\newcommand{\dB}{\mathbf{d}}
\newcommand{\eB}{\mathbf{e}}
\newcommand{\fB}{\mathbf{f}}
\newcommand{\gB}{\mathbf{g}}
\newcommand{\hB}{\mathbf{h}}
\newcommand{\iB}{\mathbf{i}}
\newcommand{\jB}{\mathbf{j}}
\newcommand{\kB}{\mathbf{k}}
\newcommand{\lB}{\mathbf{l}}
\newcommand{\mB}{\mathbf{m}}
\newcommand{\nB}{\mathbf{n}}
\newcommand{\oB}{\mathbf{o}}
\newcommand{\pB}{\mathbf{p}}
\newcommand{\qB}{\mathbf{q}}
\newcommand{\rB}{\mathbf{r}}
\newcommand{\sB}{\mathbf{s}}
\newcommand{\tB}{\mathbf{t}}
\newcommand{\uB}{\mathbf{u}}
\newcommand{\vB}{\mathbf{v}}
\newcommand{\wB}{\mathbf{w}}
\newcommand{\xB}{\mathbf{x}}
\newcommand{\yB}{\mathbf{y}}
\newcommand{\zB}{\mathbf{z}}
\newcommand{\Exp}{\mathbb{E}}
|| \Exp_{\xB \sim p_{\text{data}}} \fB (\xB) - \Exp_{\zB \sim p_{\zB}(\zB)} \fB (G(\zB)) ||_2^2
$$

Where $\fB (\xB)$ and $\fB (\zB)$ are the activations in some hidden layer
corresponding to either a real or generated image.

### Minibatch Discrimination

Problem: generators like to collapse to a single mode (ie just generate a
single image), because it's a decent local optimum.

Solution: make sure the discriminator can look at samples in combination,
so it will know if it's getting the same (or similar) images more easily.
Just give the discriminator features that tell it about the distance of
each image to other images in the same batch. The diagram in the paper
describes this best.

They mention this tensor $T$ in the paper, but don't really explain what it
is. In [the code][mbcode], it appears to be basically a weight matrix,
which means that it is also learnt as part of the discriminator.

### Historical Averaging

Problem: no guarantee with gradient descent that a two player game like
this won't go into extended orbits.

Solution: encourage parameters to revert to their historical mean, with an
L2 penalty:

$$
\newcommand{\thetaB}{\boldsymbol{\theta}}
|| \thetaB - \frac{1}{t} \sum_{i=1}^t \thetaB[i] ||^2
$$

Orbits are penalised by always being far from their mean, and this is
supposed to correspond to a "fictitious play" algorithm. I'm not sure if
that's true, but maybe?

### One-sided label smoothing

Problem: vulnerability of discriminator to adversarial examples? (Not
explicitly stated).

Solution: replace positive (ie probability that a sample is real?) labels
with a target _smaller than 1_.

### Virtual Batch Normalisation

Problem: batch normalisation is highly variable, as it is based on
statistics of the current minibatch (enough so that you can sometimes avoid
using dropout if you're using batchnorm).

Solution: for every minibatch, use the statistics gathered from a
_reference minibatch_ for your batch normalisation. For every minibatch,
you'll have to first propagate through the reference minibatch with your
current parameter settings, but then you can use the statistics you gather
by doing this for the minibatch you're actually going to use for training.

_Interesting sidenote_: in the [code][impcode], they are actually using
[weight normalization][weightnorm] instead of batchnorm (not in all cases).
Probably because both papers have Tim Salimans as first author.

### Assessing Image Quality

Problem: noisy labelling from mechanical turk.

Solution: aim for low entropy conditional categorical distribution when
labelling samples with Google's inception model. The inception model gives
you $p(y|\xB)$, so you want to maximise:

$$
\Exp_{\xB} KL (p(y|\xB)||p(y))
$$

Then they exponentiate the resulting value for no real reason, just to make
values easier to compare. Since they say this matches human judgement in
their experiments, this means we can all start using this measure and just
cite this paper!

### Semi-supervised Learning

Problem: standard semi-supervised, we have some data that is labelled and
some that isn't, how to learn a conditional model that will give us
$p(y|\xB)$.

Solution: make "generated" a class in your classification problem. Now you
can put generated samples into your dataset, but even better you can
produce a loss on unlabeled samples that you just _don't want them to be
labeled as "generated"_. So we end up with the following two losses for
supervised and unsupervised data:

$$
L_{\text{supervised}} = - \Exp_{\xB,y \sim p_{\text{data}} (\xB, y)} \log p_{\text{model}} (y | \xB, y < K + 1)
$$

$$
L_{\text{unsupervised}} = - \{ \Exp_{\xB \sim p_{\text{data}} (\xB)} \log [
1- p_{\text{model}} (y = K+1 | \xB) ] + \Exp_{\xB \sim G}\log [
p_{\text{model}} (y=K+1 | \xB)] \}
$$

With this method, and using feature matching but _not minibatch
discrimination_, they show SOTA results for semi-supervised learning on
MNIST, SVHN and CIFAR-10.

[mbcode]: https://github.com/openai/improved-gan/blob/master/mnist_svhn_cifar10/nn.py#L132-L170
[impcode]: https://github.com/openai/improved-gan/blob/master/mnist_svhn_cifar10/nn.py#L45-L91
[weightnorm]: https://arxiv.org/abs/1602.07868
[short]: http://www.shortscience.org/paper?bibtexKey=journals/corr/SalimansGZCRC16#udibr
[improved]: https://arxiv.org/abs/1606.03498
[dcgan]: https://arxiv.org/abs/1511.06434
[style]: https://arxiv.org/abs/1508.06576

arxiv.org
arxiv-vanity.com
scholar.google.com

"Why Should I Trust You?": Explaining the Predictions of Any Classifier
Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.AI, stat.ML
more

[link] Summary by Martin Thoma 8 years ago

This paper describes how to find local interpretable model-agnostic explanations (LIME) why a black-box model $m_B$ came to a classification decision for one sample $x$. The key idea is to evaluate many more samples around $x$ (local) and fit an interpretable model $m_I$ to it. The way of sampling and the kind of interpretable model depends on the problem domain.

For computer vision / image classification, the image $x$ is divided into superpixels. Single super-pixels are made black, the new image $x'$ is evaluated $p' = m_B(x')$. This is done multiple times. 

The paper is also explained in [this YouTube video](https://www.youtube.com/watch?v=KP7-JtFMLo4) by Marco Tulio Ribeiro.

A very similar idea is already in the [Zeiler & Fergus paper](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma).

## Follow-up Paper

* June 2016: [Model-Agnostic Interpretability of Machine Learning](https://arxiv.org/abs/1606.05386)
* November 2016:
  * [Nothing Else Matters: Model-Agnostic Explanations By Identifying Prediction Invariance](https://arxiv.org/abs/1611.05817)
  * [An unexpected unity among methods for interpreting
model predictions](https://arxiv.org/abs/1611.07478)

papers.nips.cc
scholar.google.com

Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation.
Matthias Hein and Maksym Andriushchenko
Neural Information Processing Systems Conference - 2017 via Local dblp
Keywords:

[link] Summary by David Stutz 7 years ago

Hein and Andriushchenko give a intuitive bound on the robustness of neural networks based on the local Lipschitz constant. With robustness, the authors refer a small $\epsilon$-ball around each sample; this ball is supposed to describe the region where the neural network predicts a constant class. This means that adversarial examples have to compute changes large enough to leave these robust areas. Larger $\epsilon$-balls imply higher robustness to adversarial examples.

When considering a single example $x$, and a classifier $f = (f_1, \ldots, f_K)^T$ (i.e. in a multi-class setting), the bound can be stated as follows. For $q$ and $p$ such that $\frac{1}{q} + \frac{1}{p} = 1$ and $c$ being the class predicted for $x$, the it holds

$x = \arg\max_j f_j(x + \delta)$

for all $\delta$ with

$\|\delta\|_p \leq \max_{R > 0}\min \left\{\min_{j \neq c} \frac{f_c(x) – f_j(x)}{\max_{y \in B_p(x, R)} \|\nabla f_c(y) - \nabla f_j(y)\|_q}, R\right\}$.

Here, $B_p(x, R)$ describes the $R$-ball around $x$ measured using the $p$-norm. Based on the local Lipschitz constant (in the denominator), the bound essentially measures how far we can deviate from the sample $x$ (measured in the $p$-norm) until $f_j(x) > f_c(x)$ for some $j \neq c$. The higher the local Lipschitz constant, the smaller deviations are allowed, i.e. adversarial examples are easier to find. Note that the bound also depends on the confidence, i.e. the edge  $f_c(x)$ has in comparison to all other $f_j(x)$.

In the remaining paper, the authors also provide bounds for simple classifiers including linear classifiers, kernel methods and two-layer perceptrons (i.e. one hidden layer). For the latter, they also propose a new type of regularization called cross-Lipschitz regularization:

$P(f) = \frac{1}{nK^2} \sum_{i = 1}^n \sum_{l,m = 1}^K \|\nabla f_l(x_i) - \nabla f_m(x_i)\|_2^2$.

This regularization term is intended to reduce the Lipschitz constant locally around training examples. They show experimental results using this regularization on MNIST and CIFAR, see the paper for details.

Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Variational Inference with Normalizing Flows
Danilo Jimenez Rezende and Shakir Mohamed
arXiv e-Print archive - 2015 via Local arXiv
Keywords: stat.ML, cs.AI, cs.LG, stat.CO, stat.ME
more

[link] Summary by CodyWild 7 years ago

This paper argues for the use of normalizing flows - a way of building up new probability distributions by applying multiple sets of invertible transformations to existing distributions - as a way of building more flexible variational inference models. 

The central premise of a variational autoencoder is that of learning an approximation to the posterior distribution of latent variables - p(z|x) - and parameterizing that distribution according to values produced by a neural network. In typical practice, this has meant that VAEs are limited in terms of the complexity of latent variable distributions they can encode, since using an analytically specified distribution tends to limit you to simpler distributional shapes - Gaussians, uniform, and the like. Normalizing flows are here proposed as a way to allow for the model to learn more complex forms of posterior distribution. 

Normalizing flows work off of a fairly simple intuition: if you take samples from a distribution p(x), and then apply a function f(x) to each x in that sample, you can calculate the expected value of your new distribution f(x) by calculating the expectation of f(x) under the old distribution p(x). That is to say: 
https://i.imgur.com/NStm7zN.png
This mathematical transformation has a pretty delightful name - The Law of the Unconscious Statistician - that came from the fact that so many statisticians just treated this identity as a definitional fact, rather than something actually in need of proving (I very much fall into this bucket as well). The implication of this is that if you apply many transformations in sequence to the draws from some simple distribution, you can work with that distribution without explicitly knowing its analytical formulation, just by being able to evaluate - and, importantly - invert the function. The ability to invert the function is key, because of the way you calculate the derivative: by taking the inverse of the determinant of the derivative of your function f(z) with respect to z. (Note here that q(z) is the original distribution you sampled under, and q’(z) is the implicit density you’re trying to estimate, after your function has been applied). 

https://i.imgur.com/8LmA0rc.png

Combining these ideas together: a variational flow autoencoder works by having an encoder network define the parameters of a simple distribution (Gaussian or Uniform), and then running the samples from that distribution through a series of k transformation layers. This final transformed density over z is then given to the decoder to work with. Some important limitations are in place here, the most salient of which is that in order to calculate derivatives, you have to be able to calculate the determinant of the derivative of a given transformation. Due to this constraint, the paper only tests a few transformations where this is easy to calculate analytically - the planar transformation and radial transformation. If you think about transformations of density functions as fundamentally stretching or compressing regions of density, the planar transformation works by stretching along an axis perpendicular to some parametrically defined plane, and the radial transformation works by stretching outward in a radial way around some parametrically defined point. Even though these transformations are individually fairly simple, when combined, they can give you a lot more flexibility in distributional space than a simple Gaussian or Uniform could. 

https://i.imgur.com/Xf8HgHl.png