Fully Convolutional Networks for Semantic Segmentation
Jonathan Long
and
Evan Shelhamer
and
Trevor Darrell
arXiv e-Print archive - 2014 via Local arXiv
Keywords:
cs.CV
First published: 2014/11/14 (10 years ago) Abstract: Convolutional networks are powerful visual models that yield hierarchies of
features. We show that convolutional networks by themselves, trained
end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic
segmentation. Our key insight is to build "fully convolutional" networks that
take input of arbitrary size and produce correspondingly-sized output with
efficient inference and learning. We define and detail the space of fully
convolutional networks, explain their application to spatially dense prediction
tasks, and draw connections to prior models. We adapt contemporary
classification networks (AlexNet, the VGG net, and GoogLeNet) into fully
convolutional networks and transfer their learned representations by
fine-tuning to the segmentation task. We then define a novel architecture that
combines semantic information from a deep, coarse layer with appearance
information from a shallow, fine layer to produce accurate and detailed
segmentations. Our fully convolutional network achieves state-of-the-art
segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012),
NYUDv2, and SIFT Flow, while inference takes one third of a second for a
typical image.
## Terms
* Semantic Segmentation: Traditional segmentation divides the image in visually similar patches. Semantic segmentation on the other hand divides the image in semantically meaningful patches. This usually means to classify each pixel (e.g.: This pixel belongs to a cat, that pixel belongs to a dog, the other pixel is background).
## Main ideas
* Complete neural networks which were trained for image classification can be used as a convolution. Those networks can be trained on Image Net (e.g. VGG, AlexNet, GoogLeNet)
* Use upsampling to (1) reduce training and prediction time (2) improve consistency of output. (See [What are deconvolutional layers?](http://datascience.stackexchange.com/a/12110/8820) for an explanation.)
## How FCNs work
1. Train a neural network for image classification which is trained on input images of a fixed size ($d \times w \times h$)
2. Interpret the network as a single convolutional filter for each output neuron (so $k$ output neurons means you have $k$ filters) over the complete image area on which the original network was trained.
3. Run the network as a CNN over an image of any size (but at least $d \times w \times h$) with a stride $s \in \mathbb{N}_{\geq 1}$
4. If $s > 1$, then you need an upsampling layer (deconvolutional layer) to convert the coarse output into a dense output.
## Nice properties
* FCNs take images of arbitrary size and produce an image of the same output size.
* Computationally efficient
## See also:
https://www.quora.com/What-are-the-benefits-of-converting-a-fully-connected-layer-in-a-deep-neural-network-to-an-equivalent-convolutional-layer
> They allow you to treat the convolutional neural network as one giant filter. You can then spatially apply the neural net as a convolution to images larger than the original training image size, getting a spatially dense output.
>
> Let's say you train a neural net (with some loss function) with a convolutional layer (3 x 3, stride of 2), pooling layer (3 x 3, stride of 2), and a fully connected layer with 10 units, using 25 x 25 images. Note that the receptive field size of each max pooling unit is 7 x 7, so the pooling output is 5 x 5. You can convert the fully connected layer to to a set of 10 5 x 5 convolutional filters (unit strides). If you do that, the entire net can be treated as a filter with receptive field size 35 x 35 and stride of 4. You can then take that net and apply it to a 50 x 50 image, and you'd get a 3 x 3 x 10 spatially dense output.