This paper proposes a model for solving discriminative tasks with video inputs. The
model consists of two convolutional nets. The input to one net is an appearance
frame. The input to the second net is a stack of densely computed optical flow
features. Each pathway is trained separately to classify its input. The
prediction for a video is obtained by taking a (weighted) average of the
predictions made by each net.