The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation
Simon Jégou
and
Michal Drozdzal
and
David Vazquez
and
Adriana Romero
and
Yoshua Bengio
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CV
First published: 2016/11/28 (8 years ago) Abstract: State-of-the-art approaches for semantic image segmentation are built on
Convolutional Neural Networks (CNNs). The typical segmentation architecture is
composed of (a) a downsampling path responsible for extracting coarse semantic
features, followed by (b) an upsampling path trained to recover the input image
resolution at the output of the model and, optionally, (c) a post-processing
module (e.g. Conditional Random Fields) to refine the model predictions.
Recently, a new CNN architecture, Densely Connected Convolutional Networks
(DenseNets), has shown excellent results on image classification tasks. The
idea of DenseNets is based on the observation that if each layer is directly
connected to every other layer in a feed-forward fashion then the network will
be more accurate and easier to train.
In this paper, we extend DenseNets to deal with the problem of semantic
segmentation. We achieve state-of-the-art results on urban scene benchmark
datasets such as CamVid and Gatech, without any further post-processing module
nor pretraining. Moreover, due to smart construction of the model, our approach
has much less parameters than currently published best entries for these
datasets.
In the paper, authors Bengio et al. , use the DenseNet for semantic segmentation. DenseNets iteratively concatenates input feature maps to output feature maps. The biggest contribution was the use of a novel upsampling path - given conventional upsampling would've caused severe memory cruch.
#### Background
All fully convolutional semantic segmentation nets generally follow a conventional path - a downsampling path which acts as feature extractor, an upsampling path that restores the locational information of every feature extracted in the downsampling path.
As opposed to Residual Nets (where input feature maps are added to the output) , in DenseNets,the output is concatenated to input which has some interesting implications:
- DenseNets are efficient in the parameter usage, since all the feature maps are reused
- DenseNets perform deep supervision thanks to short path to all feature maps in the architecture
Using DenseNets for segmentation though had an issue with upsampling in the conventional way of concatenating feature maps through skip connections as feature maps could easily go beyond 1-1.5 K. So Bengio et al. suggests a novel way - wherein only feature maps produced in the last Dense layer are only upsampled and not the entire feature maps. Post upsampling, the output is concatenated with feature maps of same resolution from downsampling path through skip connection. That way, the information lost during pooling in the downsampling path can be recovered.
#### Methodology & Architecture
In the downsampling path, the input is concatenated with the output of a dense block, whereas for upsampling the output of dense block is upsampled (without concatenating it with the input) and then concatenated with the same resolution output of downsampling path.
Here's the overall architecture
![](https://i.imgur.com/tqsPj72.png)
Here's how a Dense Block looks like
![](https://i.imgur.com/MMqosoj.png)
#### Results
The 103 Conv layer based DenseNet (FC-DenseNet103) performed better than shallower networks when compared on CamVid dataset. Though the FC-DenseNets were not pre-trained or used any post-processing like CRF or temporal smoothening etc. When comparing to other nets FC-DenseNet architectures achieve state-of-the-art, improving upon models with 10 times more parameters. It is also worth mentioning that small model FC-DenseNet56 already outperforms popular architectures with at least 100 times more parameters.