The goal of this paper is to find a specific object in an image. Initially a region proposal algorithm is used to identify candidate regions containing objects. The goal is to avoid processing all of these candidates. The idea here is to use RL to identify the neighboring candidates that should be used as a base to transform to get the next coordinates.
Starting from the center, all candidates windows that are overlapped by a radius around the center are evaluated with the RL policy $\pi$. The state input to the $\pi$ function is a combination of the features extracted from a CNN as well as values to track the state of the search such as how many candidates have been evaluated. The candidate that is selected has it's features extracted and these features are then transformed into coordinates of where to look next. Then the processing is repeated for that next point until a proper classification is made or the algorithm decides to stop.