[link]
**Contributions**: * Use dropout to get segmentation with a measure of model uncertainty. **Explanation**: We can consider dropout as a way of getting samples from a posterior distribution of models [see these papers: [1]( https://arxiv.org/abs/1506.02142), [2](https://arxiv.org/abs/1506.02158)) and thus can be be used to do Bayesian inference. This amounts to using dropout both during train and test time and getting multiple outputs (i.e sampling from model distribution) in test time. Mean of these outputs is taken as final segmentation and variation as model uncertainty. Sample averaging performs better than weight averaging (i.e usual test time method for dropout) if averaged over more than 6 samples. Paper used 50 samples. General technique which can be applied to any segmentation model. Sample averaging alone improves scores by 2-3%. **Benchmarks**: <score without dropout -> score with dropout> *VOC2012* Dilation Network: 71.3 -> 73.1 [Source](http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=6&submid=6103#KEY_Bayesian%20Dilation%20Network) FCN8: 62.2 -> 65.4 [Source](http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=6&submid=6103#KEY_Bayesian%20FCN) SegNet: 59.1 -> 60.5 (Source: reported in the paper) *CamVid* SegNet: 71.20 -> 76.3 (Source: reported in the paper) **My comments** Nothing very new. Gotcha is that sample averaging performs better. ![]() |
[link]
Layer-wise Relevance Propagation (LRP) is a novel technique has been used by authors in multiple use-cases (apart from this publication) to demonstrate the robustness and advantage of a *decomposition* method over other heatmap generation methods. Such heatmap generation methods are very crucial for increasing interpretability of Deep Learning models as such. Apart from LRP relevance, authors also discuss quantitative ways to measure the accuracy of the heatmap generated. ### LRP & Alternatives What is LRP ? LRP is a principled approach to decompose a classification decision into pixel-wise relevances indicating the contributions of a pixel to the overall classification score. The approach is derived from a layer-wise conservation principle , which forces the propagated quantity (e.g. evidence for a predicted class) to be preserved between neurons of two adjacent layers. Denoting by R(l) [i] the relevance associated to the ith neuron of layer and by R (l+1) [j] the relevance associated to the jth neuron in the next layer, the conservation principle requires that  where R(l) [i] is given as  where z[i,j] is the activation of jth neuron because of input from ith neuron As per authors this is not necssarily the only relevance funtion which is conserved. The intuition behind using such a function is that lower-layer neurons that mostly contribute to the activation of the higher-layer neuron receive a larger share of the relevance Rj of the neuron j. A downside of this propagation rule (at least if *epsilon* = 0) is that the denominator may tend to zero if lower-level contributions to neuron j cancel each other out. The numerical instability can be overcome by setting *epsilon* > 0. However in that case, the conservation idea is relaxated in order to gain better numerical properties. To conserve relevance, it can be formulated as sum of positive and negative activations  such that *alpha* - *beta* = 1 #### Alternatives to LRP for heatmap **Senstiivity measurement** In such methods of generating heamaps, gradient of the output with respect to input is used for generating heatmap. This quantity measures how much small changes in the pixel value locally affect the network output. ##### Disadvantages Given most models use ReLU as activation function, the gradient flows only through activation with positive output - thereby making makes the backward mapping discontinuous, and consequently strongly local. Also same applies for maxpool activations - wherein gradients only flow through neurons with maximum intensity in local neighbourhood. Also, given most of these methods use absolute impact on prediction cause by changes in pixel intensities, the granularity of whether the pixel intensity was in favour or against evidence is lost. **Deconvolutional Networks** ##### Disadvantages Here the backward discontinuity problem of sensitivity based methods are absent, hence global features can be captured. However, since the method only takes in activation from final layer (which learns the presence or absence of features mostly) , using this for generating heatmaps is likely to yield avergae maps, lacking image specific localisation effects LRP is able to counter the effects nicely because of the way it uses relevance #### Performance of heatmaps Few concerns that the authors raise are - A heatmap is not a segmentation mask on the contrary missing evidence or the context may be very important for classification - Salient features represent average explanations of what distinguishes one image category from another. For individual images these explanations may be meaningless or even wrong. For instance, salient features for the class ‘bicycle’ may be the wheels and the handlebar. However, in some images a bicycle may be partly occluded so that these parts of a bike are not visible. In these images salient features fail to explain the classifier’s decision (which still may be correct). Authors propose a novel method (MoRF - *Most Relevant First* ) of objectively quantifying quality of a heatmap. A good detailed idea of the measure can best be obtained from the paper. To give an idea, the most reliable method should ideally rank the most relevant regions in the same order even if small perturbations in pixel intensities are observed (in non-relevant areas. The quantity of interest in this case is the area over the MoRF perturbation curve (AOPC). #### Observation Most of the sensitivity based methods answer to the question - *what change would make the image more or less belong to the category car* which isn't really the classifier's question. LRP plans to answer the real classifier question *what speaks for the presence of a car in the image* An image below would be a good example of how LRPs can denoise heatmaps generated on the basis of sensitivity.  ![]() |
[link]
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN). Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool. #### Methodology Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class. ##### Major Changes and intutions **Mask prediction** Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask. Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation **RoIAlign** RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). **Backbone architecture** Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) **Training Objective** The training objective looks like this  Lmask is the addition from Faster RCNN. The method to calculate was mentioned above #### Observation Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper ![]() |
[link]
Feature Pyramid Networks (FPNs) build on top of the state-of-the-art implementation for object detection net - Faster RCNN. Faster RCNN faces a major problem in training for scale-invariance as the computations can be memory-intensive and extremely slow. So FRCNN only applies multi-scale approach while testing. On the other hand, feature pyramids were mainstream when hand-generated features were used -primarily to counter scale-invariance. Feature pyramids are collections of features computed at multi-scale versions of the same image. Improving on a similar idea presented in *DeepMask*, FPN brings back feature pyramids using different feature maps of conv layers with differing spatial resolutions with predictiosn happening on all levels of pyramid. Using feature maps directly as it is, would be tough as initial layers tend to contain lower level representations and poor semantics but good localisation whereas deeper layers tend to constitute higher level representations with rich semantics but suffer poor localisation due to multiple subsampling. ##### Methodology FPN can be used with any normal conv architecture, that's used for classification. In such an architecture all layers have progressively decreasing spatial resolutions (say C1, C2,..C5). FPN would now take C5 and convolve with 1x1 kernel to reduce filters to give P5. Next, P5 is upsampled and merged it to C4 (C4 is convolved with 1x1 kernel to decrease filter size in order to match that of upsampled P5) by adding element wise to produce P4. Similarly P4 is upsampled and merged with C3(in a similar way) to give P3 and so on. The final set of feature maps, in this case {P2 .. P5} are used as feature pyramids. This is how pyramids would look like  *Usage of combination of {P2,..P5} as compared to only P2* : P2 produces highest resolution, most semantic features and could as well be the default choice but because of shared weights across rest of feature layers and the learned scale invariance makes the pyramidal variant more robust to generating false ROIs For next steps, it could be RPN or RCNN, the regression and classifier would share weights across for all *anchors* (of varying aspect ratios) at each level of the feature pyramids. This step is similar to [Single Shot Detector (SSD) Networks ](http://www.shortscience.org/paper?bibtexKey=conf/eccv/LiuAESRFB16) ##### Observation The FPN was used in FRCNN in both parts of RPN and RCNN separately and then combined FPN in both parts and produced state-of-the-art result in MS COCO challenges bettering results of COCO '15 & '16 winner models ( Faster RCNN +++ & GMRI) for mAP. FPN also can be used for instance segmentation by using fully convolutional layers on top of the image pyramids. FPN outperforms results from *DeepMask*, *SharpMask*, *InstanceFCN* ![]() |
[link]
SSD aims to solve the major problem with most of the current state of the art object detectors namely Faster RCNN and like. All the object detection algortihms have same methodology - Train 2 different nets - Region Proposal Net (RPN) and advanced classifier to detect class of an object and bounding box separately. - During inference, run the test image at different scales to detect object at multiple scales to account for invariance This makes the nets extremely slow. Faster RCNN could operate at **7 FPS with 73.2% mAP** while SSD could achieve **59 FPS with 74.3% mAP ** on VOC 2007 dataset. #### Methodology SSD uses a single net for predict object class and bounding box. However it doesn't do that directly. It uses a mechanism for choosing ROIs, training end-to-end for predicting class and boundary shift for that ROI. ##### ROI selection Borrowing from FasterRCNNs SSD uses the concept of anchor boxes for generating ROIs from the feature maps of last layer of shared conv layer. For each pixel in layer of feature maps, k default boxes with different aspect ratios are chosen around every pixel in the map. So if there are feature maps each of m x n resolutions - that's *mnk* ROIs for a single feature layer. Now SSD uses multiple feature layers (with differing resolutions) for generating such ROIs primarily to capture size invariance of objects. But because earlier layers in deep conv net tends to capture low level features, it uses features after certain levels and layers henceforth. ##### ROI labelling Any ROI that matches to Ground Truth for a class after applying appropriate transforms and having Jaccard overlap greater than 0.5 is positive. Now, given all feature maps are at different resolutions and each boxes are at different aspect ratios, doing that's not simple. SDD uses simple scaling and aspect ratios to get to the appropriate ground truth dimensions for calculating Jaccard overlap for default boxes for each pixel at the given resolution ##### ROI classification SSD uses single convolution kernel of 3*3 receptive fields to predict for each ROI the 4 offsets (centre-x offset, centre-y offset, height offset , width offset) from the Ground Truth box for each RoI, along with class confidence scores for each class. So that is if there are c classes (including background), there are (c+4) filters for each convolution kernels that looks at a ROI. So summarily we have convolution kernels that look at ROIs (which are default boxes around each pixel in feature map layer) to generate (c+4) scores for each RoI. Multiple feature map layers with different resolutions are used for generating such ROIs. Some ROIs are positive and some negative depending on jaccard overlap after ground box has scaled appropriately taking resolution differences in input image and feature map into consideration. Here's how it looks :  ##### Training For each ROI a combined loss is calculated as a combination of localisation error and classification error. The details are best explained in the figure.  ##### Inference For each ROI predictions a small threshold is used to first filter out irrelevant predictions, Non Maximum Suppression (nms) with jaccard overlap of 0.45 per class is applied then on the remaining candidate ROIs and the top 200 detections per image are kept. For further understanding of the intuitions regarding the paper and the results obtained please consider giving the full paper a read. The open sourced code is available at this [Github repo](https://github.com/weiliu89/caffe/tree/ssd) ![]() |
[link]
#### Idea Reverse Classification Accuracy (RCA) models are aims to answer the question on how to estimate performance of models (semantic segmentation models were explained in the paper) in cases where ground truth is not available. #### Why is it important Before deployment, performance is quantified using different metrics, for which the predicted segmentation is compared to a reference segmentation, often obtained manually by an expert. But little is known about the real performance after deployment when a reference is unavailable. RCA aims to quantify the performance in those deployment scenarios #### Methodology The RCA model pipeline follows a simple enough pipeline for the same: 1. Train a model M on training dataset T containing input images and ground truth {**I**,**G**} 2. Use M to predict segmentation map for an input image II to get segmentation map SS 3. Train a RCA model that uses input image II to predict SS. As it's a single datapoint for the model it would overfit. There's no validation set for the RCA model 4. Test the performance of RCA model on Images which have ground truth G and the best performance of the model is an indicator of the performance (DSC - Dice Similarity Coefficient) of how the original image would perform on a new image whose ground truth is not available to compute segmentation accuracy (DSC) #### Observation For validation of the RCA method, the predicted DSC and the real DSC were compared and the correlation between the 2 was calculated. For all calculations 3 types of methods of segmentation were used and 3 slightly different types methods for RCA were used for comparison. The predicted DSC and real DSC were highly correlated for most of the cases. Here's a snap of the results that they obtained  ![]() |
[link]
[Batch Normalization Ioffe et. al 2015](Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) is one of the remarkable ideas in the era of deep learning that sits with the likes of Dropout and Residual Connections. Nonetheless, last few years have shown a few shortcomings of the idea, which two years later Ioffe has tried to solve through the concept that he calls Batch Renormalization. Issues with Batch Normalization - Different parameters used to compute normalized output during training and inference - Using Batch Norm with small minibatches - Non-i.i.d minibatches can have a detrimental effect on models with batchnorm. For e.g. in a metric learning scenario, for a minibatch of size 32, we may randomly select 16 labels then choose 2 examples for each of these labels, the examples interact at every layer and may cause model to overfit to the specific distribution of minibatches and suffer when used on individual examples. The problem with using moving averages in training, is that it causes gradient optimization and normalization in opposite direction and leads to model blowing up. Idea of Batch Renormalization We know that, ${\frac{x_i - \mu}{\sigma} = \frac{x_i - \mu_B}{\sigma_B}.r + d}$ where, ${r = \frac{\sigma_B}{\sigma}, d = \frac{\mu_B - \mu}{\sigma}}$ So the batch renormalization algorithm is defined as follows  Ioffe writes further that for practical purposes, > In practice, it is beneficial to train the model for a certain number of iterations with batchnorm alone, without the correction, then ramp up the amount of allowed correction. We do this by imposing bounds on r and d, which initially constrain them to 1 and 0, respectively, and then are gradually relaxed. In experiments, For Batch Renorm, author used $r_{max}$ = 1, $d_{max}$ = 0 (i.e. simply batchnorm) for the first 5000 training steps, after which these were gradually relaxed to reach $r_{max}$ = 3 at 40k steps, and $d_{max}$ = 5 at 25k steps. A training step means, an update to the model. ![]()
2 Comments
|
[link]
In the paper, authors Bengio et al. , use the DenseNet for semantic segmentation. DenseNets iteratively concatenates input feature maps to output feature maps. The biggest contribution was the use of a novel upsampling path - given conventional upsampling would've caused severe memory cruch. #### Background All fully convolutional semantic segmentation nets generally follow a conventional path - a downsampling path which acts as feature extractor, an upsampling path that restores the locational information of every feature extracted in the downsampling path. As opposed to Residual Nets (where input feature maps are added to the output) , in DenseNets,the output is concatenated to input which has some interesting implications: - DenseNets are efficient in the parameter usage, since all the feature maps are reused - DenseNets perform deep supervision thanks to short path to all feature maps in the architecture Using DenseNets for segmentation though had an issue with upsampling in the conventional way of concatenating feature maps through skip connections as feature maps could easily go beyond 1-1.5 K. So Bengio et al. suggests a novel way - wherein only feature maps produced in the last Dense layer are only upsampled and not the entire feature maps. Post upsampling, the output is concatenated with feature maps of same resolution from downsampling path through skip connection. That way, the information lost during pooling in the downsampling path can be recovered. #### Methodology & Architecture In the downsampling path, the input is concatenated with the output of a dense block, whereas for upsampling the output of dense block is upsampled (without concatenating it with the input) and then concatenated with the same resolution output of downsampling path. Here's the overall architecture  Here's how a Dense Block looks like  #### Results The 103 Conv layer based DenseNet (FC-DenseNet103) performed better than shallower networks when compared on CamVid dataset. Though the FC-DenseNets were not pre-trained or used any post-processing like CRF or temporal smoothening etc. When comparing to other nets FC-DenseNet architectures achieve state-of-the-art, improving upon models with 10 times more parameters. It is also worth mentioning that small model FC-DenseNet56 already outperforms popular architectures with at least 100 times more parameters. ![]() |
[link]
Sub-pixel CNN proposes a novel architecture for solving the ill-posed problem of super-resolution (SR). Super-resolution involves the recovery of a high resolution (HR) image or video from its low resolution (LR) counter part and is topic of great interest in digital image processing. #### Background & problems with current approaches LR images can be understood as low-pass filtered versions of HR images. A key assumption that underlies many SR techniques is that much of the high-frequency spatial data is redundant and thus can be accurately reconstructed from low frequency components. There are 2 kinds of SR techniques - one which assumes multiple LR images as different instances of HR images and uses CV techniques like registration and fusion to construct HR images. Apart from constraints, of requiring multiple images, they are inaccurate as well. The other one single image super-resolution (SISR) techniques learn implicit redundancy that is present in natural data to recover missing HR information from a single LR instance. Among SISR techniques, the ones which deployed deep learning, most methods would tend to upsample LR images (using bicubic interpolation, with learnable filters) and learn the filters from the upsampled image. The problem with the methods are, normally creasing the resolution of the LR images before the image enhancement step increases the computational complexity. This is especially problematic for convolutional networks, where the processing speed directly depends on the input image resolution. Secondly, interpolation methods typically used to accomplish the task, such as bicubic interpolation do not bring additional information to solve the ill-posed reconstruction problem. Also, among other techniques like deconvolution, the pixels are filled with zero values between pixels, hence hese zero values have no gradient information that can be backpropagated through. #### Innovation In sub-pixel CNN authors propose an architecture (ESPCNN) that learns all the filters in 2 convolutional layers with the resolutions in LR space. Only in the last layer, a convolutional layer is implemented which transforms into HR space, using sub-pixel convolutional layer. The layer, in order to tide over problems from deconvolution, uses something known as Periodic Shuffling (PS) operator for upsampling. The idea behind periodic shuffling is derived from the fact that activations between pixel shouldn't be zero for upsamling. The kernel in sub-pixel CNN does a rearrangement following the equation  The image explains the architecture of Sub-pixel CNN. The colouring in the last layer explains the methodology for PS  #### Results The ESPCNN was trained and tested on ImageNet along with other databases . As is evident from the image, ESPCNN performs significantly better than Super-Resolution Convolutional Neural Network (SRCNN) & TRNN, which are currently the best performing approach published, as of date of publication of ESPCNN.  ![]() |
[link]
ENet, a fully convolutional net is aimed at making real-time inferences for segmentation. ENet is up to 18× faster, requires 75× less FLOPs, has 79× less parameters, and provides similar or better accuracy to existing models. SegNet has been used for benchamarking purposes. ##### Innovation ENet has been able to achieve faster speeds because instead of following the regular encoder-decoder structure followed by most state-of-art semantic segmentation nets, it uses a ResNet kind of approach. So while most nets learn features from a highly down-sampled convolutional map, and then upsample the learnt features, ENet tries to learn the features using underlying principles of ResNet from a convolution map that is only downsampled thrice in the encoder part. The encoder part is designed to have same functionality as convolutional architectures used for classification tasks. Also, the author mentions that multiple downsampling tends to hurt accuracy apart from huritng memory constraints by way of upsampling layera. Hence , by way of including only 2 upsampling operations & a single deconvolution, ENet aims to achieve faster semantic segmentation. ##### Architecture The architecture of the net looks like this.  where the initial block and each of the bottlenecks look like this respectively  In oder to reduce FLOPs further, no bias terms were included in any of the projection steps as cuDNN uses separate kernels for convolution and bias addition. ##### Training The encoder and decoder parts are trained sparately. The encoder part was trained to categorize downsampled regions of the input image, then the decoder was appended and trained the network to perform upsampling and pixel-wise classification ##### Results The results are reported on on widely used NVIDIA Titan X GPU as well as on NVIDIA TX1 embedded system module. For segmentation results on CamVid, CityScape & SUN RGB-D datasets, the inference times and model sizes were significantly lower, while the IoU or accuracy (as applicable) was nearly equivalent to SegNet for most cases ![]() |
[link]
Xception Net or Extreme Inception Net brings a new perception of looking at the Inception Nets. Inception Nets, as was first published (as GoogLeNet) consisted of Network-in-Network modules like this  The idea behind Inception modules was to look at cross-channel correlations ( via 1x1 convolutions) and spatial correlations (via 3x3 Convolutions). The main concept being that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly. This idea is the genesis of Xception Net, using depth-wise separable convolution ( convolution which looks into spatial correlations across all channels independently and then uses pointwise convolutions to project to the requisite channel space leveraging inter-channel correlations). Chollet, does a wonderful job of explaining how regular convolution (looking at both channel & spatial correlations simultaneously) and depthwise separable convolution (looking at channel & spatial correlations independently in successive steps) are end points of spectrum with the original Inception Nets lying in between.  *Though for Xception Net, Chollet uses, depthwise separable layers which perform 3x3 convolutions for each channel and then 1x1 convolutions on the output from 3x3 convolutions (opposite order of operations depicted in image above)* ##### Input Input for would be images that are used for classification along with corresponding labels. ##### Architecture Architecture of Xception Net uses one for VGG-16 with convolution-maxpool blocks replaced by residual blocks of depthwise separable convolution layers. The architecture looks like this  ##### Results Xception Net was trained using hyperparameters tuned for best performance of Inception V3 Net. And for both internal dataset and ImageNet dataset, Xception outperformed Inception V3. Points to be noted - Both Xception & Inception V3 have roughly similar no of parameters (~24 M), hence any improvement in performance can't be attributed to network size - Xception normally takes slightly lower training time compared to Inception V3, which can be configured to be lower in future ![]() |
[link]
3D Image Classification is one of the utmost important requirements in healthcare. But due to huge size of such images and less amount of data, it has become difficult to train traditional end-to-end models of classification such as AlexNet, Resnet, VGG. Major approaches considered for 3D images use slices of the image and not the image as whole. This approach at times leads to loss of information across third axis. The paper uses the complete 3D image for training, and since the number of images are less we use unsupervised techniques for weights initialization and fine tune the fully connected layers using supervised learning for classification. The major contribution of the paper is the extension of 2D Convolutional Auto-encoders to 3D Domain and taking it further for MRI Classification which can be useful in numerous other applications. **Algorithm** 1. Train 3D Convolution Auto-encoders using [CADDementia dataset](https://grand-challenge.org/site/caddementia/home/) 2. Extract the encoder layer from the trained auto-encoders and add fully connected layers to it, followed by softmax 3. Train and test the classification model using [ADNI data](http://adni.loni.usc.edu/) **3D CAN Model** This model is 3D extension of [2D Convolutional Auto-encoders](http://people.idsia.ch/~ciresan/data/icann2011.pdf) (2D CAN)  Each convolution in encoder is a 3D Convolution followed by ReLU and maxpool, while in decoder, each convolution is a maxunpool followed by 3D Full Convolution and ReLU. In the last decoder block, instead of ReLU, Sigmoid will be used. Here the weights were shared in the decoder and encoder as specified in [2D CAN](http://people.idsia.ch/~ciresan/data/icann2011.pdf) **3D Classification model**  The weights of the classification model are initialized using the trained 3D-CAN model as mentioned in algorithm. **Loss Function** Weighted negative log likelihood for Classification and Mean Squared Error for Unsupervised learning **Training algorithm** [Adadelta](https://arxiv.org/abs/1212.5701) **Results** Obtained 89.1% on 10-fold cross validation on dataset of 270 patients for classification of MRI into Mild Cognitive Impairment (MCI), Normal Control (NC) and Alzheimer's disease (AD) **Image Source** All images are taken from the [paper](https://arxiv.org/abs/1607.00455) itself. ![]() |
[link]
Facebook has [released a series of papers](https://research.facebook.com/blog/learning-to-segment/) for object segmentation and detection. This paper is the first in that series. This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)): 1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm. 2. A CNN classifier is applied on each of the proposals. The current paper improves the step 1, i.e., region/object proposals. Most object proposals approaches fall into three categories: * Objectness scoring * Seed Segmentation * Superpixel Merging Current method is different from these three. It share similarities with [Faster R-CNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN. The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object. ## Model and Training Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGG-A pretrained model. Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints: * the patch contains an object roughly centered in the input patch * the object is fully contained in the patch and in a given scale range Note that the network must output a mask for a single object at the center even when multiple objects are present. Figure 1 shows the architecture and sampling for training.  Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation. ## Inference During full image inference, model is applied densely at multiple locations and scales. This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN).  This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation. ![]() |
[link]
Fast RCNN is a proposal detection net for object detection tasks. ##### Input & Output The input to a Fast RCNN would be the input image and the region proposals (generated using Selective Search). There are 2 outputs of the net, probability map of all possible objects & background ( e.g. 21 classes for Pascal VOC'12) and corresponding bounding box parameters for each object classes. ##### Architecture The Fast RCNN version of any deep net would need 3 major modifications. For e.g. for VGG'16 1. A ROI pooling layer needs to be added after the final maxpool output before fully connected layers 2. The final FC layer is replaced by 2 sibling branched layers - one for giving a softmax output for probability classes, other one is for predicting an encoding of 4 bounding box parameters (x,y, width,height) w.r.t. region proposals 3. Modifying the input 2 take 2 input. images and corresponding prposals **ROI Pooling layer** - The most notable contribution from the paper is designed to maxpool the features inside a proposed region into a fixed size (for VGG'16 version of FCNN it was 7 x 7) . The intuition behind the layer is make it faster as compared to SPPNets, (which used spatial pyramidal pooling) and RCNN. ##### Results The net is trained with dual loss (log loss on probability output + squared error loss on bounding box parameters) . The results were very impressive, on the VOC '07, '10 & '12 datasets with Fast RCNN outperforming the rest of the nets, in terms of mAp accuracy ![]() |