The main contribution of this paper is introducing a (recurrent) visual attention model (RAM). Convolutional networks (CNs) seem to do a great job in computer vision tasks. Unfortunately, the amount of computation they require grows (at least) linearly in the size of the image. The RAM surges as an alternative that performs as well as CNs, but where the amount of computation can be controlled independently of the image size.
#### What is RAM?
A model that describes a sequential decision process of a goal-directed agent interacting with a visual environment. It involves deciding where to look in a constrained visual environment and taking decisions to maximize a reward. It uses a recurrent neural network to combine information from the past to decide its future actions.
#### What do we gain?
The attention mechanism takes care of deciding the parts of the image that are worth looking to solve the task. Therefore, it will ignore clutter. In addition, the amount of computation can be decided independently of the image sizes. Furthermore, this could also be directly applied to variable size images as well as detecting multiple objects in one image.
#### What follows?
An extension that may be worth exploring is whether the attention mechanism can be made differentiable. This might be already done in other papers.
#### Like:
* Can be used for analyzing videos and playing games.
Useful in cluttered environments.
#### Dislike:
* The model is non-differentiable.
TLDR; The authors train a RNN that takes as input a glimpse (part of the image subsamples to same size) and outputs a new glimpse and action (prediction, agent move) at each step. Thus, the model adaptively selects which part of an image to "attend" to. By defining the number of glimpses and their reoslutions we can control the complexity of the model independently of image size, which is not true for CNNs. The model is not differentiable, but can be trained using Reinforcement Learning techniques. The authors evaluate the model on the MNIST dataset, a cluttered version of MNIST, and a dynamic video game environment.
#### Questions / Notes
- I think the the author's claim taht the model works independently of image size is only partly true, as larger images are likely to require more glimpses or bigger regions.
- Would be nice to see some large-scale benchmarks as MNIST is very simple tasks. However, the authors clearly identify this as future work.
- No mentions about training time. Is it even feasible to train this for large images (which probably require more glimpses)?