The paper proposes a novel way to train a sparse autoencoder where the hidden unit sparsity is governed by a winner-take-all kind of selection scheme. This is a convincing way to achieve a sparse autoencoder, while the paper could have included some more details about their training strategy and the complexity of the algorithm.
The authors present a fully connected auto-encoder with a new sparsity constraint called the lifetime sparsity. For each hidden unit across the mini-batch, they rank the activation values, keeping only the top-k% for reconstruction. The approach is appealing because they don't need to find a hard threshold and it makes sure every hidden unit/filter is updated (no dead filters because their activation was below the threshold).
Their encoder is a deep stack of ReLu and the decoder is shallow and linear (note that usually non-symmetric auto-encoders lead to worse results). They also show how to apply to RBM. The effect of sparsity is very effective and noticeable on the images depicting the filters.
They extend this auto-encoder in a convolutional/deconvolutional framework, making it possible to train on larger images than MNIST or TFD. They add a spatial sparsity, keeping the top activation per feature map for the reconstruction and combine it with the lifetime sparsity presented before.
The proposed approach exploits on a mechanism close to the one of k-sparse autoencoders proposed by Makkhzani et al [14]. The authors extend the idea from [14] to build winner-take-all encoders (and RBMs), that enforce both spatial and lifetime regularization by keeping only a percentage (the biggest) of activations. The lifetime sparsity allows overcoming problems that could arise with k-sparse autoencoders. The authors next propose to embed their modeling framework in convolutional neural nets to deal with larger images than e.g. those of mnist.