SoftTarget Regularization: An Effective Technique to Reduce Over-Fitting in Neural Networks on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

SoftTarget Regularization: An Effective Technique to Reduce Over-Fitting in Neural Networks
Armen Aghajanyan
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

Summaries/Notes 1

[link] Summary by felipe 7 years ago

This paper introduces a new regularization technique that aims at reducing over-fitting without reducing the capacity of a model. It draws on the claim that models start to over-fit data when co-label similarities start to disappear, e.g. when the model output does not show that dogs of similar breeds like German shepherd and Belgian shepherd are similar anymore.

The idea is that models in an early training phase *do* show these similarities. In order to keep this information in the model, target labels $Y^t_c$ for training step $t$ are changed by adding the exponential mean of output labels of previous training steps $\hat{Y}^t$:

$$
\hat{Y}^t = \beta \hat{Y}^{t-1} + (1-\beta)F(X, W),\\
Y_c^t = \gamma\hat{Y}^t  + (1-\gamma)Y,
$$

where $F(X,W)$ is the current network's output, and $Y$ are the ground truth labels. This way, the network should remember which classes are similar to each other. The paper shows that training using the proposed regularization scheme preserves co-label similarities (compared to an over-fitted model) similarly to dropout. This confirms the intuition the proposed method is based on.

The method introduces several new hyper-parameters:
 - $\beta$, defining the exponential decay parameter for averaging old predictions
 - $\gamma$, defining the weight of soft targets to ground truth targets
 - $n_b$, the number of 'burn-in' epochs, in which the network is trained with hard targets only
 - $n_t$, the number of epochs between soft-target updates

Results on MNIST, CIFAR-10 and SVHN are encouraging, as networks with soft-target regularization achieve lower losses on almost all configurations. However, as of today, the paper does not show how this translates to classification accuracy. Also, it seems that the results are from one training run only, so it is difficult to assess if this improvement is systematic.

Can you expand on "target labels are augmented with a exponential mean of output labels"? Why is the moving average effective? What impact does it have on training?

Thanks for your comment, I detailed how the training labels are changed. However, the paper does not show why explicitly the moving average is effective. I also could not find the effect on the training process itself, just on the training result (lower test loss).

Thanks! That is really clear now. So I think SoftTarget is making it harder for a weight update to make drastic changes to the output. So only persistent changes to the output are adopted. The output changing slowly will cause the loss and then gradient to slowly shrink and grow. This feels like a momentum that takes into account the individual outputs of the loss.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private