The idea in this paper is to develop a version of attention that will incorporate similarity in neighboring bins. This aligned with the work \cite{conf/icml/BeckhamP17} which presented a different approach to deal with consistency between classes of predictions.
In this work the closed form softmax function is replaced by a small optimization problem with this regularizer:
$$ +\lambda \sum_{i=1}^{d-1} |y_{i+1}-y_i|$$
Because of this, many of the neighboring probabilities are exactly the same resulting in attention that can be seen as blocks.
https://i.imgur.com/oue0x4V.png
Poster:
https://i.imgur.com/gclMjzR.png