The paper proposes a new image representation for recognition based on a stacking of two layers of Fisher vector encoders, with the first layer capturing semi-local information and the second performing sum-pooling aggregation over the entire picture. The approach is inspired by the recent success of deep convolutional networks (CNN). The key-difference is that the architecture proposed in this paper is predominantly hand-designed with relatively few parameters learned compared to CNNs. This is both the strength and the weakness of the approach as it leads to much faster training but also slighter lower accuracy compared to fully learned deep networks.
This paper uses Fisher Vectors as inner building blocks in a recognition architecture. The basic Fisher vector module had previously demonstrated superior performance in recognition application. Here, it is augmented with discriminative linear projection for dimensionality reduction, and multiscale local pooling, to make it suitable for stacking. Inputs of all layers are jointly used for classification.