ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation on ShortScience.org

arxiv.org
scholar.google.com

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
Paszke, Adam and Chaurasia, Abhishek and Kim, Sangpil and Culurciello, Eugenio
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Qure.ai 7 years ago

ENet, a fully convolutional net is aimed at making real-time inferences for segmentation. ENet is up to 18×
faster, requires 75× less FLOPs, has 79× less parameters, and provides similar or better accuracy to existing models. SegNet has been used for benchamarking purposes.
##### Innovation
ENet has been able to achieve faster speeds because instead of following the regular encoder-decoder structure followed by most state-of-art semantic segmentation nets, it uses a ResNet kind of approach. So while most nets learn features from a highly down-sampled convolutional map, and then upsample the learnt features, ENet tries to learn the features using underlying principles of ResNet from a convolution map that is only downsampled thrice in the encoder part. The encoder part is designed to have same functionality as convolutional architectures used for classification tasks. Also, the author mentions that multiple downsampling tends to hurt accuracy apart from huritng memory constraints by way of upsampling layera. Hence , by way of including only 2 upsampling operations & a single deconvolution, ENet aims to achieve faster semantic segmentation.

##### Architecture
The architecture of the net looks like this.
![enet_architecture](https://i.imgur.com/rw1lVKQ.png)

where the initial block and each of the bottlenecks look like this respectively
![enet_bottlenecks](https://i.imgur.com/sveifk5.png)

In oder to reduce FLOPs further, no bias terms were included in any of the projection steps as cuDNN uses separate kernels for convolution and bias addition.

##### Training
The encoder and decoder parts are trained sparately. The encoder part was trained to categorize downsampled regions of the input image, then the decoder was appended and trained the network to perform upsampling and pixel-wise classification

##### Results
The results are reported on on widely used NVIDIA Titan X GPU as well as on NVIDIA TX1 embedded system module. For segmentation results on CamVid, CityScape & SUN RGB-D datasets, the inference times and model sizes were significantly lower, while the IoU or accuracy (as applicable) was nearly equivalent to SegNet for most cases

Your comment: