Welcome to ShortScience.org! |
[link]
# Object detection system overview. https://i.imgur.com/vd2YUy3.png 1. takes an input image, 2. extracts around 2000 bottom-up region proposals, 3. computes features for each proposal using a large convolutional neural network (CNN), and then 4. classifies each region using class-specific linear SVMs. * R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. * On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%. ## There is a 2 challenges faced in object detection 1. localization problem 2. labeling the data 1 localization problem : * One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method. * An alternative is to build a sliding-window detector. considered adopting a sliding-window approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm. 2 labeling the data: * The conventional solution to this problem is to use unsupervised pre-training, followed by supervise fine-tuning * supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL), * fine-tuning for detection improves mAP performance by 8 percentage points. * Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs) ## Object detection with R-CNN This system consists of three modules * The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. * The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. * The third module is a set of class specific linear SVMs. Module design 1 Region proposals * which detect mitotic cells by applying a CNN to regularly-spaced square crops. * use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute). * the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) 2 Feature extraction. * extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN * Features are computed by forward propagating a mean-subtracted 227x227 RGB image through five convolutional layers and two fully connected layers. * warp all pixels in a tight bounding box around it to the required size * The feature matrix is typically 2000x4096 3 Test time detection * At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). * warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. * Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold. ## Training 1 Supervised pre-training: * pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data) 2 Domain-specific fine-tuning. * use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001. 3 Object category classifiers. * use intersection-over union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3. * Once features are extracted and training labels are applied, we optimize one linear SVM per class. * adopt the standard hard negative mining method to fit large training data in memory. ### Results on PASCAL VOC 201012 1 VOC 2010 * compared against four strong baselines including SegDPM, DPM, UVA, Regionlets. * Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster https://i.imgur.com/0dGX9b7.png 2 ILSVRC2013 detection. * ran R-CNN on the 200-class ILSVRC2013 detection dataset * R-CNN achieves a mAP of 31.4% https://i.imgur.com/GFbULx3.png #### Performance layer-by-layer, without fine-tuning 1 pool5 layer * which is the max pooled output of the network’s fifth and final convolutional layer. *The pool5 feature map is 6 x6 x 256 = 9216 dimensional * each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input 2 Layer fc6 * fully connected to pool5 * it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases 3 Layer fc7 * It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification #### Performance layer-by-layer, with fine-tuning * CNN’s parameters fine-tuned on PASCAL. * fine-tuning increases mAP by 8.0 % points to 54.2% ### Network architectures * 16-layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet. * RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0% * drawback in terms of compute time, with in terms of compute time, with than T-Net. 1 The ILSVRC2013 detection dataset * dataset is split into three sets: train (395,918), val (20,121), and test (40,152) #### CNN features for segmentation. * full R-CNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap. * fg R-CNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. * full+fg R-CNN: The third strategy (full+fg) simply concatenates the full and fg features https://i.imgur.com/n1bhmKo.png
1 Comments
|
[link]
This very new paper, is currently receiving quite a bit of attention by the [community](https://www.reddit.com/r/MachineLearning/comments/5qxoaz/r_170107875_wasserstein_gan/). The paper describes a new training approach, which solves the two major practical problems with current GAN training: 1) The training process comes with a meaningful loss. This can be used as a (soft) performance metric and will help debugging, tune parameters and so on. 2) The training process does not suffer from all the instability problems. In particular the paper reduces mode collapse significantly. On top of that, the paper comes with quite a bit mathematical theory, explaining why there approach works and other approachs have failed. This paper is a must read for anyone interested in GANs. |
[link]
This paper introduces a deep universal word embedding based on using a bidirectional LM (in this case, biLSTM). First words are embedded with a CNN-based, character-level, context-free, token embedding into $x_k^{LM}$ and then each sentence is parsed using a biLSTM, maximizing the log-likelihood of a word given it's forward and backward context (much like a normal language model). The innovation is in taking the output of each layer of the LSTM ($h_{k,j}^{LM}$ being the output at layer $j$) $$ \begin{align} R_k &= \{x_k^{LM}, \overrightarrow{h}_{k,j}^{LM}, \overleftarrow{h}_{k,j}^{LM} | j = 1 \ldots L \} \\ &= \{h_{k,j}^{LM} | j = 0 \ldots L \} \end{align} $$ and allowing the user to learn a their own task-specific weighted sum of these hidden states as the embedding: $$ ELMo_k^{task} = \gamma^{task} \sum_{j=0}^L s_j^{task} h_{k,j}^{LM} $$ The authors show that this weighted sum is better than taking only the top LSTM output (as in their previous work or in CoVe) because it allows capturing syntactic information in the lower layer of the LSTM and semantic information in the higher level. Table below shows that the second layer is more useful for the semantic task of word sense disambiguation, and the first layer is more useful for the syntactic task of POS tagging. https://i.imgur.com/dKnyvAa.png On other benchmarks, they show it is also better than taking the average of the layers (which could be done by setting $\gamma = 1$) https://i.imgur.com/f78gmKu.png To add the embeddings to your supervised model, ELMo is concatenated with your context-free embeddings, $\[ x_k; ELMo_k^{task} \]$. It can also be concatenated with the output of your RNN model $\[ h_k; ELMo_k^{task} \]$ which can show improvements on the same benchmarks https://i.imgur.com/eBqLe8G.png Finally, they show that adding ELMo to a competitive but simple baseline gets SOTA (at the time) on very many NLP benchmarks https://i.imgur.com/PFUlgh3.png It's all open-source and there's a tutorial [here](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) |
[link]
## General stuff about face recognition Face recognition has 4 main tasks: * **Face detection**: Given an image, draw a rectangle around every face * **Face alignment**: Transform a face to be in a canonical pose * **Face representation**: Find a representation of a face which is suitable for follow-up tasks (small size, computationally cheap to compare, invariant to irrelevant changes) * **Face verification**: Images of two faces are given. Decide if it is the same person or not. The face verification task is sometimes (more simply) a face classification task (given a face, decide which of a fixed set of people it is). Datasets being used are: * **LFW** (Labeled Faces in the Wild): 97.35% accuracy; 13 323 web photos of 5 749 celebrities * **YTF** (YouTube Faces): 3425 YouTube videos of 1 595 subjects * **SFC** (Social Face Classification): 4.4 million labeled faces from 4030 people, each 800 to 1200 faces * **USF** (Human-ID database): 3D scans of faces ## Ideas in this paper This paper deals with face alignment and face representation. **Face Alignment** They made an average face with the USF dataset. Then, for each new face, they apply the following procedure: * Find 6 points in a face (2 eyes, 1 nose tip, 2 corners of the lip, 1 middle point of the bottom lip) * Crop according to those * Find 67 points in the face / apply them to a normalized 3D model of a face * Transform (=align) face to a normalized position **Representation** Train a neural network on 152x152 images of faces to classify 4030 celebrities. Remove the softmax output layer and use the output of the second-last layer as the transformed representation. The network is: * C1 (convolution): 32 filters of size $11 \times 11 \times 3$ (RGB-channels) (returns $142\times 142$ "images") * M2 (max pooling): $3 \times 3$, stride of 2 (returns $71\times 71$ "images") * C3 (convolution): 16 filters of size $9 \times 9 \times 16$ (returns $63\times 63$ "images") * L4 (locally connected): $16\times9\times9\times16$ (returns $55\times 55$ "images") * L5 (locally connected): $16\times7\times7\times16$ (returns $25\times 25$ "images") * L6 (locally connected): $16\times5\times5\times16$ (returns $21\times 21$ "images") * F7 (fully connected): ReLU, 4096 units * F8 (fully connected): softmax layer with 4030 output neurons The training was done with: * Stochastic Gradient Descent (SGD) * Momentum of 0.9 * Performance scheduling (LR starting at 0.01, ending at 0.0001) * Weight initialization: $w \sim \mathcal{N}(\mu=0, \sigma=0.01)$, $b = 0.5$ * ~15 epochs ($\approx$ 3 days) of training ## Evaluation results * **Quality**: * 97.35% accuracy (or mean accuracy?) with an Ensemble of DNNs for LFW * 91.4% accuracy with a single network on YTF * **Speed**: DeepFace runs in 0.33 seconds per image (I'm not sure which size). This includes image decoding, face detection and alignment, **the** feed forward network (why only one? wasn't this the best performing Ensemble?) and final classification output ## See also * Andrew Ng: [C4W4L03 Siamese Network](https://www.youtube.com/watch?v=6jfw8MuKwpI) |
[link]
The paper discusses a number of extensions to the Skip-gram model previously proposed by Mikolov et al (citation [7] in the paper): which learns linear word embeddings that are particularly useful for analogical reasoning type tasks. The extensions proposed (namely, negative sampling and sub-sampling of high frequency words) enable extremely fast training of the model on large scale datasets. This also results in significantly improved performance as compared to previously proposed techniques based on neural networks. The authors also provide a method for training phrase level embeddings by slightly tweaking the original training algorithm. This paper proposes 3 improvements for the skip-gram model which allows for learning embeddings for words. The first improvement is subsampling frequent word, the second is the use of a simplified version of noise constrastive estimation (NCE) and finally they propose a method to learn idiomatic phrase embeddings. In all three cases the improvements are somewhat ad-hoc. In practice, both the subsampling and negative samples help to improve generalization substantially on an analogical reasoning task. The paper reviews related work and furthers the interesting topic of additive compositionality in embeddings. The article does not propose any explanation as to why the negative sampling produces better results than NCE which it is suppose to loosely approximate. In fact it doesn't explain why besides the obvious generalization gain the negative sampling scheme should be preferred to NCE since they achieve similar speeds. |