[link]
Summary by Anonymous 6 years ago
**Summary**:
A CNN is employed to estimate optical flow. The task is defined as a supervised learning problem. 2 architectures are proposed: a generic one (FlowNetSimple); and another including a correlation layer for feature vectors at different image locations (FlowNetCorr). Networks consist of contracting and expanding parts and are trained as a whole using back-propagation. The correlation layer in FlowNetCorr finds correspondences between feature representations of 2 images instead of following the standard matching approach of extracting features from patches of both images and then comparing them.
https://i.imgur.com/iUe8ir3.png
**Approach**:
1. *Contracting part*: Their first choice is to stack both input images together and feed them through a rather generic network, allowing the network to decide itself how to process the image pair to extract the motion information. This is called 'FlowNetSimple' and consists only of convolutional layers.
The second approach 'FlowNetCorr' is to create two separate, identical processing streams for the two images and to combine them at a later stage. The two architectures are illustrated above. The 'correlation layer' performs multiplicative patch comparisons between two feature maps. The correlation of two patches centered at $x_1$ in the first map and $x_2$ in the second map is then defined as:
$$c(x_1,x_2) =\sum_{o\in[-k,k]*[-k,k]} \langle f_{1}(x_{1}+o),f_{2}(x_{2},o) \rangle $$
for a square patch of size $K = 2k+1$.
2. *Expanding part*: It consists of the upconvolutional layers - combination of unpooling and convolution. 'Upconvolution' is applied to feature maps and concatenated with corresponding feature maps from the 'contractive' part of the network.
**Experiments**:
A new dataset called 'Flying Chairs' is created with $22,872$ image pairs and flow fields. It is created by applying affine transformations to images collected from Flickr and a publicly available set of renderings of 3D chair models [1]. Results are reported on Sintel, KITTI, Middlebury datasets, as well as on their synthetic Flying Chairs dataset. The proposed method is compared with different methods: EpicFlow [2], DeepFlow [3], EDPM [4], and LDOF [5].
The authors inferred that even though the number of parameters of the two networks (FlowNetC, FlowNetCorr) is virtually the same, the FlowNetC slightly more overfits to the training data.
The architecture has nine convolutional layers with stride of $2$ in six of them and a $ReLU$ nonlinearity after each layer. As training loss, endpoint error (EPE) is used which is the standard error measure for optical flow estimation. Below figure shows examples of optical flow prediction on the Sintel dataset. Endpoint error is also shown.
https://i.imgur.com/xIRpUZQ.png
**Scope for Improvement**: It would be interesting to see the performance of the network on more realistic data.
**References**:
[1] Aubry, Mathieu, et al. "Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
[2] Revaud, Jerome, et al. "Epicflow: Edge-preserving interpolation of correspondences for optical flow." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
[3] Weinzaepfel, Philippe, et al. "DeepFlow: Large displacement optical flow with deep matching." Proceedings of the IEEE International Conference on Computer Vision. 2013.
[4] Bao, Linchao, Qingxiong Yang, and Hailin Jin. "Fast edge-preserving patchmatch for large displacement optical flow." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.
[5] Brox, Thomas, and Jitendra Malik. "Large displacement optical flow: descriptor matching in variational motion estimation." IEEE transactions on pattern analysis and machine intelligence 33.3 (2011): 500-513.
more
less