V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation
arXiv e-Print archive - 2016 via Local Bibsonomy
Medical image segmentation have been a classic problem in medical image analysis, with a score of research backing the problem. Many approaches worked by designing hand-crafted features, while others worked using global or local intensity cues. These approaches were sometimes extended to 3D, but most of the algorithms work with 2D images (or 2D slices of a 3D image). It is hypothesized that using the full 3D volume of a scan may improve segmentation performance due to the amount of context that the algorithm can be exposed to, but such approaches have been very expensive computationally. Deep learning approches like ConvNets have been applied to segmentation problems, which are computationally very efficient during inference time due to highly optimized linear algebra routines. Although these approaches form the state-of-art, they still utilize 2D views of a scan, and fail to work well on full 3D volumes.
To this end, Milletari et al. propose a new CNN architecture consisting of volumetric convolutions with 3D kernels, on full 3D MRI prostate scans, trained on the task of segmenting the prostate from the images. The network architecture primarily consisted of 3D convolutions which use volumetric kernels having size 5x5x5 voxels. As the data proceeds through different stages along the compression
path, its resolution is reduced. This is performed through convolution with
2x2x2 voxels wide kernels applied with stride 2, hence there are no pooling layers in the architecture. The architecutre resembles an encoder-decoder type architecture with the decoder part, also called downsampling, reduces the size of the signal presented as input and increases the receptive field of the features being computed in subsequent network layers. Each of the stages of the left part of the network, computes a number of features which is two times higher than the one of the previous layer. The right portion of the network extracts features and expands the spatial
support of the lower resolution feature maps in order to gather and assemble the necessary information to output a two channel volumetric segmentation. The two features maps computed by the very last convolutional layer, having 1x1x1 kernel size and producing outputs of the same size as the input volume, are converted to probabilistic segmentations of the foreground and background
regions by applying soft-max voxelwise.
In order to train the network, the authors propose to use Dice loss function. The CNN is trained end-to-end on a dataset of 50 prostate scans in MRI. The network approached a 0.869 $\pm$ 0.033 dice loss, and beat the other state-of-art models.