Welcome to ShortScience.org! 
[link]
TLDR; The authors propose two different architectures to improve the performance of characterlevel RNNs. In the first architecture ("mixed") the authors condition the model on the state of a wordlevel RNN. In the second architecture ("cond") they condition the output classifier on character ngrams. The authors show that the proposed architecture outperform plain characterlevel RNNs in terms of entropy in bits per character. #### Key Points  Plain characterlevel RNNs need a huge hidden representation in order to model longterm dependencies. But Wordlevel RNNs can't generalize to new vocabulary and may require a huge output vocab.  Model 1: Jointly train wordlevel and charlevel CNN. Interpolate the losses of the two models.  Model 2: Condition softmax on ngrams before character, "relieving" the network of memorizing some of the sequence.  Training: Constant learning rate, reduce every epoch when validation accuracy decreases  Ngram model can be applied to arbitrary data, not just characters. Authors evaluate on binary data. #### Notes / Questions  In the comparison table the authors don't show the number of parameters for the models. They compare models with the same number of hidden units, but their proposed architecture need extra parameters and computation. Unfair comparison?  People typically use LSTMs/GRUs for language modeling. Of course the proposed techniques can be applied to LSTM/GRU networks, but the experimental result may look very different. Do these architectures result in any benefit when using LSTM/GRU char data?  Entropy in bits per character seems like somewhat of a strange evaluation metric. I don't really know what to make of it, and no intuitive explanations are given.  One argument the authors make in the paper is that characterlevel models can be applied to arbitrary input data (different languages, binary data, code, etc). But their mixed is clearly very languagespecific. It can't be applied to arbitrary data, and many languages don't have clear word boundaries. Similarly, ngrams may be prohibituvely expensive depending on what kind of data we're working with.  The ngram conditioned models isn't clearly explained, I *think* I understand what it does, but I'm not quite sure. No intuitive explanations what any of the models are learning are given. 
[link]
Ranjan et al. propose to constrain deep features to lie on hyperspheres in order to improve robustness against adversarial examples. For the last fullyconnected layer, this is achieved by the L2softmax, which forces the features to lie on the hypersphere. For intermediate convolutional or fullyconnected layer, the same effect is achieved analogously, i.e., by normalizing inputs, scaling them and applying the convolution/weight multiplication. In experiments, the authors argue that this improves robustness against simple attacks such as FGSM and DeepFool. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization** $$W \sim U \left [  \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$ where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$. Showing some ways **how to debug neural networks** might be another reason to read the paper. The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign). However, no regularization was used and many minibatch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much. Questions that remain open for me: * [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9) * Figure 4: Why is this plot not simply completely dependent on the data? * Is softsign still used? Why not? * If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?
1 Comments

[link]
[code](https://github.com/openai/improvedgan), [demo](http://infinitechamber35121.herokuapp.com/cifarminibatch/1/?), [related](http://www.inference.vc/understandingminibatchdiscriminationingans/) ### Feature matching problem: overtraining on the current discriminator solution: ￼$E_{x \sim p_{\text{data}}}f(x)  E_{z \sim p_{z}(z)}f(G(z))_{2}^{2}$ were f(x) activations intermediate layer in discriminator ### Minibatch discrimination problem: generator to collapse to a single point solution: for each sample i, concatenate to $f(x_i)$ features $b$ measuring its distance to other samples j (i and j are both real or generated samples in same batch): $\sum_j \exp(M_{i, b}  M_{j, b}_{L_1})$ ￼ this generates visually appealing samples very quickly ### Historical averaging problem: SGD fails by going into extended orbits solution: parameters revert to the mean $ \theta  \frac{1}{t} \sum_{i=1}^t \theta[i] ^2$ ￼ ### Onesided label smoothing problem: discriminator vulnerability to adversarial examples solution: discriminator target for positive samples is 0.9 instead of 1 ### Virtual batch normalization problem: using BN cause output of examples in batch to be dependent solution: use reference batch chosen once at start of training and each sample is normalized using itself and the reference. It's expensive so used only on generation ### Assessment of image quality problem: MTurk not reliable solution: use inception model p(yx) to compute ￼$\exp(\mathbb{E}_x \text{KL}(p(y  x)  p(y)))$ on 50K generated images x ### Semisupervised learning use the discriminator to also classify on K labels when known and use all real samples (labels and unlabeled) in the discrimination task ￼$D(x) = \frac{Z(x)}{Z(x) + 1}, \text{ where } Z(x) = \sum_{k=1}^{K} \exp[l_k(x)]$. In this case use feature matching but not minibatch discrimination. It also improves the quality of generated images.
3 Comments

[link]
# Object detection system overview. https://i.imgur.com/vd2YUy3.png 1. takes an input image, 2. extracts around 2000 bottomup region proposals, 3. computes features for each proposal using a large convolutional neural network (CNN), and then 4. classifies each region using classspecific linear SVMs. * RCNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. * On the 200class ILSVRC2013 detection dataset, RCNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%. ## There is a 2 challenges faced in object detection 1. localization problem 2. labeling the data 1 localization problem : * One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method. * An alternative is to build a slidingwindow detector. considered adopting a slidingwindow approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the slidingwindow paradigm. 2 labeling the data: * The conventional solution to this problem is to use unsupervised pretraining, followed by supervise finetuning * supervised pretraining on a large auxiliary dataset (ILSVRC), followed by domain specific finetuning on a small dataset (PASCAL), * finetuning for detection improves mAP performance by 8 percentage points. * Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs) ## Object detection with RCNN This system consists of three modules * The first generates categoryindependent region proposals. These proposals define the set of candidate detections available to our detector. * The second module is a large convolutional neural network that extracts a fixedlength feature vector from each region. * The third module is a set of class specific linear SVMs. Module design 1 Region proposals * which detect mitotic cells by applying a CNN to regularlyspaced square crops. * use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute). * the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) 2 Feature extraction. * extract a 4096dimensional feature vector from each region proposal using the Caffe implementation of the CNN * Features are computed by forward propagating a meansubtracted 227x227 RGB image through five convolutional layers and two fully connected layers. * warp all pixels in a tight bounding box around it to the required size * The feature matrix is typically 2000x4096 3 Test time detection * At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). * warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. * Given all scored regions in an image, we apply a greedy nonmaximum suppression (for each class independently) that rejects a region if it has an intersectionover union (IoU) overlap with a higher scoring selected region larger than a learned threshold. ## Training 1 Supervised pretraining: * pretrained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using imagelevel annotations only (bounding box labels are not available for this data) 2 Domainspecific finetuning. * use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001. 3 Object category classifiers. * use intersectionover union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3. * Once features are extracted and training labels are applied, we optimize one linear SVM per class. * adopt the standard hard negative mining method to fit large training data in memory. ### Results on PASCAL VOC 201012 1 VOC 2010 * compared against four strong baselines including SegDPM, DPM, UVA, Regionlets. * Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster https://i.imgur.com/0dGX9b7.png 2 ILSVRC2013 detection. * ran RCNN on the 200class ILSVRC2013 detection dataset * RCNN achieves a mAP of 31.4% https://i.imgur.com/GFbULx3.png #### Performance layerbylayer, without finetuning 1 pool5 layer * which is the max pooled output of the network’s fifth and final convolutional layer. *The pool5 feature map is 6 x6 x 256 = 9216 dimensional * each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input 2 Layer fc6 * fully connected to pool5 * it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216dimensional vector) and then adds a vector of biases 3 Layer fc7 * It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying halfwave rectification #### Performance layerbylayer, with finetuning * CNN’s parameters finetuned on PASCAL. * finetuning increases mAP by 8.0 % points to 54.2% ### Network architectures * 16layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fullyconnected layers. We refer to this network as “ONet” for OxfordNet and the baseline as “TNet” for TorontoNet. * RCNN with ONet substantially outperforms RCNN with TNet, increasing mAP from 58.5% to 66.0% * drawback in terms of compute time, with in terms of compute time, with than TNet. 1 The ILSVRC2013 detection dataset * dataset is split into three sets: train (395,918), val (20,121), and test (40,152) #### CNN features for segmentation. * full RCNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap. * fg RCNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. * full+fg RCNN: The third strategy (full+fg) simply concatenates the full and fg features https://i.imgur.com/n1bhmKo.png
1 Comments
