First published: 2016/02/08 (7 years ago) Abstract: Image-generating machine learning models are typically trained with loss
functions based on distance in the image space. This often leads to
over-smoothed results. We propose a class of loss functions, which we call deep
perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of
computing distances in the image space, we compute distances between image
features extracted by deep neural networks. This metric better reflects
perceptually similarity of images and thus leads to better results. We show
three applications: autoencoder training, a modification of a variational
autoencoder, and inversion of deep convolutional networks. In all cases, the
generated images look sharp and resemble natural images.
This paper proposed a class of loss functions applicable to image generation that are based on distance in feature spaces:
$$\mathcal{L} = \lambda_{feat}\mathcal{L}_{feat} + \lambda_{adv}\mathcal{L}_{adv} + \lambda_{img}\mathcal{L}_{img}$$
### Key Points
- Using only l2 loss in image space yields over-smoothed results since it leads to averaging all likely locations of details.
- L_feat measures the distance in suitable feature space and therefore preserves distribution of fine details instead of exact locations.
- Using only L_feat yields bad results since feature representations are contractive. Many non-natural images also mapped to the same feature vector.
- By introducing a natural image prior - GAN, we can make sure that samples lie on the natural image manifold.
### Model
https://i.imgur.com/qNzMwQ6.png
### Exp
- Training Autoencoder
- Generate images using VAE
- Invert feature
### Thought
I think the experiment section is a little complicated to comprehend. However, the proposed loss seems really promising and can be applied to many tasks related to image generation.
### Questions
- Section 4.2 & 4.3 are hard to follow for me, need to pay more attention in the future