Welcome to ShortScience.org! 
[link]
Ilyas et al. present a followup work to their paper on the tradeoff between accuracy and robustness. Specifically, given a feature $f(x)$ computed from input $x$, the feature is considered predictive if $\mathbb{E}_{(x,y) \sim \mathcal{D}}[y f(x)] \geq \rho$; similarly, a predictive feature is robust if $\mathbb{E}_{(x,y) \sim \mathcal{D}}\left[\inf_{\delta \in \Delta(x)} yf(x + \delta)\right] \geq \gamma$. This means, a feature is considered robust if the worstcase correlation with the label exceeds some threshold $\gamma$; here the worstcase is considered within a predefined set of allowed perturbations $\Delta(x)$ relative to the input $x$. Obviously, there also exist predictive features, which are however not robust according to the above definition. In the paper, Ilyas et al. present two simple algorithms for obtaining adapted datasets which contain only robust or only nonrobust features. The main idea of these algorithms is that an adversarially trained model only utilizes robust features, while a standard model utilizes both robust and nonrobust features. Based on these datasets, they show that nonrobust, predictive features are sufficient to obtain high accuracy; similarly training a normal model on a robust dataset also leads to reasonable accuracy but also increases robustness. Experiments were done on Cifar10. These observations are supported by a theoretical toy dataset consisting of two overlapping Gaussians; I refer to the paper for details. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
# Object detection system overview. https://i.imgur.com/vd2YUy3.png 1. takes an input image, 2. extracts around 2000 bottomup region proposals, 3. computes features for each proposal using a large convolutional neural network (CNN), and then 4. classifies each region using classspecific linear SVMs. * RCNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. * On the 200class ILSVRC2013 detection dataset, RCNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%. ## There is a 2 challenges faced in object detection 1. localization problem 2. labeling the data 1 localization problem : * One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method. * An alternative is to build a slidingwindow detector. considered adopting a slidingwindow approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the slidingwindow paradigm. 2 labeling the data: * The conventional solution to this problem is to use unsupervised pretraining, followed by supervise finetuning * supervised pretraining on a large auxiliary dataset (ILSVRC), followed by domain specific finetuning on a small dataset (PASCAL), * finetuning for detection improves mAP performance by 8 percentage points. * Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs) ## Object detection with RCNN This system consists of three modules * The first generates categoryindependent region proposals. These proposals define the set of candidate detections available to our detector. * The second module is a large convolutional neural network that extracts a fixedlength feature vector from each region. * The third module is a set of class specific linear SVMs. Module design 1 Region proposals * which detect mitotic cells by applying a CNN to regularlyspaced square crops. * use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute). * the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) 2 Feature extraction. * extract a 4096dimensional feature vector from each region proposal using the Caffe implementation of the CNN * Features are computed by forward propagating a meansubtracted 227x227 RGB image through five convolutional layers and two fully connected layers. * warp all pixels in a tight bounding box around it to the required size * The feature matrix is typically 2000x4096 3 Test time detection * At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). * warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. * Given all scored regions in an image, we apply a greedy nonmaximum suppression (for each class independently) that rejects a region if it has an intersectionover union (IoU) overlap with a higher scoring selected region larger than a learned threshold. ## Training 1 Supervised pretraining: * pretrained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using imagelevel annotations only (bounding box labels are not available for this data) 2 Domainspecific finetuning. * use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001. 3 Object category classifiers. * use intersectionover union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3. * Once features are extracted and training labels are applied, we optimize one linear SVM per class. * adopt the standard hard negative mining method to fit large training data in memory. ### Results on PASCAL VOC 201012 1 VOC 2010 * compared against four strong baselines including SegDPM, DPM, UVA, Regionlets. * Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster https://i.imgur.com/0dGX9b7.png 2 ILSVRC2013 detection. * ran RCNN on the 200class ILSVRC2013 detection dataset * RCNN achieves a mAP of 31.4% https://i.imgur.com/GFbULx3.png #### Performance layerbylayer, without finetuning 1 pool5 layer * which is the max pooled output of the network’s fifth and final convolutional layer. *The pool5 feature map is 6 x6 x 256 = 9216 dimensional * each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input 2 Layer fc6 * fully connected to pool5 * it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216dimensional vector) and then adds a vector of biases 3 Layer fc7 * It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying halfwave rectification #### Performance layerbylayer, with finetuning * CNN’s parameters finetuned on PASCAL. * finetuning increases mAP by 8.0 % points to 54.2% ### Network architectures * 16layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fullyconnected layers. We refer to this network as “ONet” for OxfordNet and the baseline as “TNet” for TorontoNet. * RCNN with ONet substantially outperforms RCNN with TNet, increasing mAP from 58.5% to 66.0% * drawback in terms of compute time, with in terms of compute time, with than TNet. 1 The ILSVRC2013 detection dataset * dataset is split into three sets: train (395,918), val (20,121), and test (40,152) #### CNN features for segmentation. * full RCNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap. * fg RCNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. * full+fg RCNN: The third strategy (full+fg) simply concatenates the full and fg features https://i.imgur.com/n1bhmKo.png
1 Comments

[link]
This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes resampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and nonrigid deformation whose paramerters are trained endtoend with the rest of the model. The resulting resampling grid is then used to create a new representation of the underlying signal through bilinear or nearest neighbor interpolation. This has interesting implications: the network can learn to colocate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous stateoftheart on a number of tasks. The layer has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence underforming the image or previous layer. Gradients for backpropagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the finegrained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases. 
[link]
# Deep Convolutional Generative Adversarial Nets ## Introduction * The paper presents Deep Convolutional Generative Adversarial Nets (DCGAN)  a topologically constrained variant of conditional GAN. * [Link to the paper](https://arxiv.org/abs/1511.06434) ## Benefits * Stable to train * Very useful to learn unsupervised image representations. ## Model * GANs difficult to scale using CNNs. * Paper proposes following changes to GANs: * Replace any pooling layers with strided convolutions (for discriminator) and fractional strided convolutions (for generators). * Remove fully connected hidden layers. * Use batch normalisation in both generator (all layers except output layer) and discriminator (all layers except input layer). * Use LeakyReLU in all layers of the discriminator. * Use ReLU activation in all layers of the generator (except output layer which uses Tanh). ## Datasets * LargeScale Scene Understanding. * Imagenet1K. * Faces dataset. ## Hyperparameters * Minibatch SGD with minibatch size of 128. * Weights initialized with 0 centered Normal distribution with standard deviation = 0.02 * Adam Optimizer * Slope of leak = 0.2 for LeakyReLU. * Learning rate = 0.0002, β1 = 0.5 ## Observations * LargeScale Scene Understanding data * Demonstrates that model scales with more data and higher resolution generation. * Even though it is unlikely that model would have memorized images (due to low learning rate of minibatch SGD). * Classifying CIFAR10 dataset * Features * Train in Imagenet1K and test on CIFAR10. * Max pool discriminator's convolutional features (from all layers) to get 4x4 spatial grids. * Flatten and concatenate to get a 28672dimensional vector. * Linear L2SVM classifier trained over the feature vector. * 82.8% accuracy, outperforms Kmeans (80.6%) * Street View House Number Classifier * Similar pipeline as CIFAR10 * 22.48% test error. * The paper contains many examples of images generated by final and intermediate layers of the network. * Images in the latent space do not show sharp transitions indicating that network did not memorize images. * DCGAN can learn an interesting hierarchy of features. * Networks seems to have some success in disentangling image representation from object representation. * Vector arithmetic can be performed on the Z vectors corresponding to the face samples to get results like `smiling woman  normal woman + normal man = smiling man` visually. 
[link]
This paper describes how to find local interpretable modelagnostic explanations (LIME) why a blackbox model $m_B$ came to a classification decision for one sample $x$. The key idea is to evaluate many more samples around $x$ (local) and fit an interpretable model $m_I$ to it. The way of sampling and the kind of interpretable model depends on the problem domain. For computer vision / image classification, the image $x$ is divided into superpixels. Single superpixels are made black, the new image $x'$ is evaluated $p' = m_B(x')$. This is done multiple times. The paper is also explained in [this YouTube video](https://www.youtube.com/watch?v=KP7JtFMLo4) by Marco Tulio Ribeiro. A very similar idea is already in the [Zeiler & Fergus paper](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma). ## Followup Paper * June 2016: [ModelAgnostic Interpretability of Machine Learning](https://arxiv.org/abs/1606.05386) * November 2016: * [Nothing Else Matters: ModelAgnostic Explanations By Identifying Prediction Invariance](https://arxiv.org/abs/1611.05817) * [An unexpected unity among methods for interpreting model predictions](https://arxiv.org/abs/1611.07478) 