ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
arXiv e-Print archive - 2016 via Local Bibsonomy
ENet, a fully convolutional net is aimed at making real-time inferences for segmentation. ENet is up to 18×
faster, requires 75× less FLOPs, has 79× less parameters, and provides similar or better accuracy to existing models. SegNet has been used for benchamarking purposes.
ENet has been able to achieve faster speeds because instead of following the regular encoder-decoder structure followed by most state-of-art semantic segmentation nets, it uses a ResNet kind of approach. So while most nets learn features from a highly down-sampled convolutional map, and then upsample the learnt features, ENet tries to learn the features using underlying principles of ResNet from a convolution map that is only downsampled thrice in the encoder part. The encoder part is designed to have same functionality as convolutional architectures used for classification tasks. Also, the author mentions that multiple downsampling tends to hurt accuracy apart from huritng memory constraints by way of upsampling layera. Hence , by way of including only 2 upsampling operations & a single deconvolution, ENet aims to achieve faster semantic segmentation.
The architecture of the net looks like this.
where the initial block and each of the bottlenecks look like this respectively
In oder to reduce FLOPs further, no bias terms were included in any of the projection steps as cuDNN uses separate kernels for convolution and bias addition.
The encoder and decoder parts are trained sparately. The encoder part was trained to categorize downsampled regions of the input image, then the decoder was appended and trained the network to perform upsampling and pixel-wise classification
The results are reported on on widely used NVIDIA Titan X GPU as well as on NVIDIA TX1 embedded system module. For segmentation results on CamVid, CityScape & SUN RGB-D datasets, the inference times and model sizes were significantly lower, while the IoU or accuracy (as applicable) was nearly equivalent to SegNet for most cases