The main contribution of this paper is introducing a (recurrent) visual attention model (RAM). Convolutional networks (CNs) seem to do a great job in computer vision tasks. Unfortunately, the amount of computation they require grows (at least) linearly in the size of the image. The RAM surges as an alternative that performs as well as CNs, but where the amount of computation can be controlled independently of the image size.
#### What is RAM?
A model that describes a sequential decision process of a goal-directed agent interacting with a visual environment. It involves deciding where to look in a constrained visual environment and taking decisions to maximize a reward. It uses a recurrent neural network to combine information from the past to decide its future actions.
#### What do we gain?
The attention mechanism takes care of deciding the parts of the image that are worth looking to solve the task. Therefore, it will ignore clutter. In addition, the amount of computation can be decided independently of the image sizes. Furthermore, this could also be directly applied to variable size images as well as detecting multiple objects in one image.
#### What follows?
An extension that may be worth exploring is whether the attention mechanism can be made differentiable. This might be already done in other papers.
* Can be used for analyzing videos and playing games.
Useful in cluttered environments.
* The model is non-differentiable.