This paper presents a variety of issues related to the evaluation of image generative models. Specifically, they provide evidence that evaluations of generative models based on the popular Parzen windows estimator or based on a visual fidelity (qualitative) measure both present serious flaws.
The Parzen windows approach to generative modeling evaluation works by taking a finite set of samples generated from a given model and then using those as the centroids of a Parzen windows Gaussian mixture. The constructed Parzen windows mixture is then used to compute a log-likelihood score on a set of test examples.
Some of the key observations made in this paper are:
1. A simple, k-means based approach can obtain better Parzen windows performance than using the original training samples for a given dataset, even though these are samples from the true distribution!
2. Even for the fairly low dimensional space of 6x6 image patches, a Parzen windows estimator would require an extremely large number of samples to come close to the true log-likelihood performance of a model.
3. Visual fidelity is a bad predictor of true log-likelihood performance, as it is possible to
Obtain great visual fidelity and arbitrarily low log-likelihood, with a Parzen windows model made of Gaussians with very small variance.
Obtain bad visual fidelity and high log-likelihood by taking a model with high log-likelihood and mixing it with a white noise model and putting as much as 99% of the mixing probability on the white noise model (i.e. which would produce bad samples 99% of the time).
4. Measuring overfitting of a model by taking samples from the model and making sure their training set nearest neighbors are different is ineffective, since it is actually trivial to generate samples that are each visually almost identical to a training example, but that yet each have large euclidean distance with their corresponding (visually similar) training example.