Littwin and Wolf propose a activation variance regularizer that is shown to have a similar, even better, effect than batch normalization. The proposed regularizer is based on an analysis of the variance of activation values; the idea is that the measured variance of these variances is low if the activation values come from a distribution with few modes. Thus, the intention of the regularizer is to encourage distributions of activations with only few modes. This is achieved using the regularizers
$\mathbb{E}[(1 - \frac{\sigma_s^2}{\sigma^2})^2]$
where $\sigma_s^2$ is the measured variance of activation values and $\sigma^2$ is the true variance of activation values. The estimate $\sigma^2_s$ is mostly influenced by the mini-batch used for training. In practice, the regularizer is replaced by
$(1 - \frac{\sigma_{s_1}^2}{\sigma_{s_2}^2 + \beta})^2$
which can be estimated on two different batches, $s_1$ and $s_2$, during training and $\beta$ is a parameter that can be learned and mainly handles the case where the variance is close to zero. In the paper, the authors provide some theoretical bounds and also make a connection to batch normalization and in which cases and why the regularizer might be a better alternative. These claims are supported by experiments on Cifar and Tiny ImageNet.
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).