Welcome to ShortScience.org! 
[link]
This method is based on improving the speed of RCNN \cite{conf/cvpr/GirshickDDM14} 1. Where RCNN would have two different objective functions, Fast RCNN combines localization and classification losses into a "multitask loss" in order to speed up training. 2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of subwindows of approximate size $h/H \times w/W$ and then maxpooling the values in each subwindow into the corresponding output grid cell." 3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values. This method is further improved by the paper "Faster RCNN" \cite{conf/nips/RenHGS15} 
[link]
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN). Faster RCNN was not designed for pixeltopixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool. #### Methodology Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class. ##### Major Changes and intutions **Mask prediction** Mask prediction segmentation predicts a binary mask for each RoI using fully convolution  and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask. Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation **RoIAlign** RoIPool first quantizes a floatingnumber RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). **Backbone architecture** Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures  ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of singlescale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) **Training Objective** The training objective looks like this ![](https://i.imgur.com/snUq73Q.png) Lmask is the addition from Faster RCNN. The method to calculate was mentioned above #### Observation Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper 
[link]
This paper presents a method to extract motion (dynamic) and skeleton / cameraview (static) representations from the video of a person represented as a 2D joints skeleton. This decomposition allows transferring the motion to different skeletons (retargeting) and many more. It does so by utilizing deep neural networks. https://i.imgur.com/J5jBzcs.png The architecture consists of motion and skeleton / cameraview encoders that decompose an input sequence of 2D joint positions into latent spaces and a decoder that reconstructs a sequence from such components. The motion vector varies in length, while skeleton and camera view representations are fixed. https://i.imgur.com/QaDksg1.png This is achieved by the nature of the network design. Specifically, motion encoder uses 1D convolutions with strides, thus output dimensions are proportionally related to the input. On the other hand, the static encoder uses global average pooling in the final layer to produce a fixedsize latent representation: https://i.imgur.com/Cf7TVKA.png More detailed design of the encoders and decoder is shown below: https://i.imgur.com/cpaveFm.png **Dataset**. Adobe Mixamo is used to obtain sequences of poses of different 3D characters. It allows creating multiple samples where different characters (with different skeleton structure) perform the same motions. These 3D video clips are then projected into 2D by selecting arbitrary view angles and distance to the object. Thus, we can easily create multiple pairs of 2D image sequences of characters (same or different) performing various actions (same or different) from various views. **Loss functions** used to for training (refer the paper for the detailed formulas):  *Cross Reconstruction Loss* It is a sum of two other losses. The first one is the reconstruction loss where the network tries to reconstruct original input. The second one is cross reconstruction loss where the network tries to reconstruct the sequence where a different character performs the exact same action as the input. It is best shown in the Figure below: https://i.imgur.com/ewZOAox.png  *Triplet Loss* This loss aims to bring latent spaces of similar motions closer together, while separate apart the ones that are different. It takes two triplets, where each contains two samples that share the same (or very similar) motion and one with different. The same concept is applied to the static latent space.  Foot velocity loss This loss helps to remove the foot skating phenomenon  hands and feet exhibit larger errors that the other keypoints. https://i.imgur.com/DclJEde.png where $V_{global}$ and $V_{joint_n}$ extract the global and local ($n$th joint) velocities from the reconstructed output $\hat{p}_{ij}$, respectively, and map them back to the image units, and $V_{orig_n}$ returns the original global velocity of the $n$th joint from the ground truth, $p_{ij}$ **Normalization**  subtract the root position from all joint locations in every frame  subtract the mean joint position and divide by the standard deviation (averaged over the entire dataset)  perframe global velocity is not touched **Data Augmentation** applied during training:  temporal clipping during the batch creation process  scaling  same as to use different camera distance to the object  flipping symmetrical joints  dropping joints to simulate behavior of a real keypoint detector as they often miss some joints  adding real video data to the training and use reprojection loss in case no labels are given **Results and Evaluation** (to be continued) ... While the summary becomes too long to be a called a summary it is worth mentioning that there are several applications possible with this approach:  performance cloning  make any 2D skeleton repeat particular motions  motion retrieval  search videos that contain the particular target motion 
[link]
[code](https://github.com/openai/improvedgan), [demo](http://infinitechamber35121.herokuapp.com/cifarminibatch/1/?), [related](http://www.inference.vc/understandingminibatchdiscriminationingans/) ### Feature matching problem: overtraining on the current discriminator solution: ￼$E_{x \sim p_{\text{data}}}f(x)  E_{z \sim p_{z}(z)}f(G(z))_{2}^{2}$ were f(x) activations intermediate layer in discriminator ### Minibatch discrimination problem: generator to collapse to a single point solution: for each sample i, concatenate to $f(x_i)$ features $b$ measuring its distance to other samples j (i and j are both real or generated samples in same batch): $\sum_j \exp(M_{i, b}  M_{j, b}_{L_1})$ ￼ this generates visually appealing samples very quickly ### Historical averaging problem: SGD fails by going into extended orbits solution: parameters revert to the mean $ \theta  \frac{1}{t} \sum_{i=1}^t \theta[i] ^2$ ￼ ### Onesided label smoothing problem: discriminator vulnerability to adversarial examples solution: discriminator target for positive samples is 0.9 instead of 1 ### Virtual batch normalization problem: using BN cause output of examples in batch to be dependent solution: use reference batch chosen once at start of training and each sample is normalized using itself and the reference. It's expensive so used only on generation ### Assessment of image quality problem: MTurk not reliable solution: use inception model p(yx) to compute ￼$\exp(\mathbb{E}_x \text{KL}(p(y  x)  p(y)))$ on 50K generated images x ### Semisupervised learning use the discriminator to also classify on K labels when known and use all real samples (labels and unlabeled) in the discrimination task ￼$D(x) = \frac{Z(x)}{Z(x) + 1}, \text{ where } Z(x) = \sum_{k=1}^{K} \exp[l_k(x)]$. In this case use feature matching but not minibatch discrimination. It also improves the quality of generated images.
3 Comments

[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{4}$), momentum of 0.9. They use minibatches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) 