This paper proposes to learn features for images using neural networks that predict the relative motion of the camera that captured two successive images. The main motivation for this approach is that such data would be very cheap to collect, as it would not require any labelling by a human and only relies on "egomotion" (and thus readily available) information. More concretely, what must be predicted is the X/Y/Z rotation or translation movements. This is converted into a classification problem by binning each movement into a fixed number of ranges of movement magnitude. The neural network architecture then consists in a siamese-style CNN (SCNN). First two Base-CNN (BCNN) with tied weights process the input image pair (one image per BCNN) to produce features for each image. These features are then concatenated and fed to a Top-CNN (TCNN) which produces a prediction for the relative transformation that relates the two images. The output layer thus contains groups of softmax units, one for each dimension of variation of the transformation (e.g. 3 for X/Y/Z rotation).
The experiments show that pretraining on this task is competitive with pretraining a CNN on the same amount of ImageNet classification data.