![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
The main contribution of this paper is introducing a new transformation that the authors call Batch Normalization (BN). The need for BN comes from the fact that during the training of deep neural networks (DNNs) the distribution of each layer’s input change. This phenomenon is called internal covariate shift (ICS). #### What is BN? Normalize each (scalar) feature independently with respect to the mean and variance of the mini batch. Scale and shift the normalized values with two new parameters (per activation) that will be learned. The BN consists of making normalization part of the model architecture. #### What do we gain? According to the author, the use of BN provides a great speed up in the training of DNNs. In particular, the gains are greater when it is combined with higher learning rates. In addition, BN works as a regularizer for the model which allows to use less dropout or less L2 normalization. Furthermore, since the distribution of the inputs is normalized, it also allows to use sigmoids as activation functions without the saturation problem. #### What follows? This seems to be specially promising for training recurrent neural networks (RNNs). The vanishing and exploding gradient problems \cite{journals/tnn/BengioSF94} have their origin in the iteration of transformation that scale up or down the activations in certain directions (eigenvectors). It seems that this regularization would be specially useful in this context since this would allow the gradient to flow more easily. When we unroll the RNNs, we usually have ultra deep networks. #### Like * Simple idea that seems to improve training. * Makes training faster. * Simple to implement. Probably. * You can be less careful with initialization. #### Dislike * Does not work with stochastic gradient descent (minibatch size = 1). * This could reduce the parallelism of the algorithm since now all the examples in a mini batch are tied. * Results on ensemble of networks for ImageNet makes it harder to evaluate the relevance of BN by itself. (Although they do mention the performance of a single model). ![]() |
[link]
# Object detection system overview. https://i.imgur.com/vd2YUy3.png 1. takes an input image, 2. extracts around 2000 bottom-up region proposals, 3. computes features for each proposal using a large convolutional neural network (CNN), and then 4. classifies each region using class-specific linear SVMs. * R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. * On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%. ## There is a 2 challenges faced in object detection 1. localization problem 2. labeling the data 1 localization problem : * One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method. * An alternative is to build a sliding-window detector. considered adopting a sliding-window approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm. 2 labeling the data: * The conventional solution to this problem is to use unsupervised pre-training, followed by supervise fine-tuning * supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL), * fine-tuning for detection improves mAP performance by 8 percentage points. * Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs) ## Object detection with R-CNN This system consists of three modules * The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. * The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. * The third module is a set of class specific linear SVMs. Module design 1 Region proposals * which detect mitotic cells by applying a CNN to regularly-spaced square crops. * use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute). * the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) 2 Feature extraction. * extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN * Features are computed by forward propagating a mean-subtracted 227x227 RGB image through five convolutional layers and two fully connected layers. * warp all pixels in a tight bounding box around it to the required size * The feature matrix is typically 2000x4096 3 Test time detection * At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). * warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. * Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold. ## Training 1 Supervised pre-training: * pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data) 2 Domain-specific fine-tuning. * use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001. 3 Object category classifiers. * use intersection-over union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3. * Once features are extracted and training labels are applied, we optimize one linear SVM per class. * adopt the standard hard negative mining method to fit large training data in memory. ### Results on PASCAL VOC 201012 1 VOC 2010 * compared against four strong baselines including SegDPM, DPM, UVA, Regionlets. * Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster https://i.imgur.com/0dGX9b7.png 2 ILSVRC2013 detection. * ran R-CNN on the 200-class ILSVRC2013 detection dataset * R-CNN achieves a mAP of 31.4% https://i.imgur.com/GFbULx3.png #### Performance layer-by-layer, without fine-tuning 1 pool5 layer * which is the max pooled output of the network’s fifth and final convolutional layer. *The pool5 feature map is 6 x6 x 256 = 9216 dimensional * each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input 2 Layer fc6 * fully connected to pool5 * it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases 3 Layer fc7 * It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification #### Performance layer-by-layer, with fine-tuning * CNN’s parameters fine-tuned on PASCAL. * fine-tuning increases mAP by 8.0 % points to 54.2% ### Network architectures * 16-layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet. * RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0% * drawback in terms of compute time, with in terms of compute time, with than T-Net. 1 The ILSVRC2013 detection dataset * dataset is split into three sets: train (395,918), val (20,121), and test (40,152) #### CNN features for segmentation. * full R-CNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap. * fg R-CNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. * full+fg R-CNN: The third strategy (full+fg) simply concatenates the full and fg features https://i.imgur.com/n1bhmKo.png ![]()
1 Comments
|
[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) ![]() |
[link]
This work expands on prior techniques for designing models that can both be stored using fewer parameters, and also execute using fewer operations and less memory, both of which are key desiderata for having trained machine learning models be usable on phones and other personal devices. The main contribution of the original MobileNets paper was to introduce the idea of using "factored" decompositions of Depthwise and Pointwise convolutions, which separate the procedures of "pull information from a spatial range" and "mix information across channels" into two distinct steps. In this paper, they continue to use this basic Depthwise infrastructure, but also add a new design element: the inverted-residual linear bottleneck. The reasoning behind this new layer type comes from the observation that, often, the set of relevant points in a high-dimensional space (such as the 'per-pixel' activations inside a conv net) actually lives on a lower-dimensional manifold. So, theoretically, and naively, one could just try to use lower dimensional internal representations to map the dimensionality of that assumed manifold. However, the authors argue that ReLU non-linearities kill information (because of the region where all inputs are mapped to zero), and so having layers contain only the number of dimensions needed for the manifold would mean that you end up with too-few dimensions after the ReLU information loss. However, you need to have non-linearities somewhere in the network in order to be able to learn complex, non-linear functions. So, the authors suggest a method to mostly use smaller-dimensional representations internally, but still maintain ReLus and the network's needed complexity. https://i.imgur.com/pN4d9Wi.png - A lower-dimensional output is "projected up" into a higher dimensional output - A ReLu is applied on this higher-dimensional layer - That layer is then projected down into a smaller-dimensional layer, which uses a linear activation to avoid information loss - A residual connection between the lower-dimensional output at the beginning and end of the expansion This way, we still maintain the network's non-linearity, but also replace some of the network's higher-dimensional layers with lower-dimensional linear ones ![]() |
[link]
This paper focuses on an effort by a Deepmind team to train an agent that can play the game Diplomacy - a complex, multiplayer game where players play as countries controlling units, trying to take over the map of Europe. Some relevant factors of this game, for the purposes of this paper, are: 1) All players move at the same time, which means you need to model your opponent's current move, and play a move that succeeds in expectation over that predicted move distribution. This also means that, in order to succeed, your policy can't be easily predictable, since, if it is, you're much easier to exploit, since your opponents can more easily optimize their response to what they predict you'll do 2) The action space is huge, even compared to something like Chess 3) The game has significant multiagent complexity: rather than being straightforwardly zero-sum in its reward structure, like Chess or Go, it's designed to require alliances between players in order to succeed Prior work - DipNet - had been able to outperform other hand-coded models through a deep network that imitated human actions, but the authors hadn't been able to get RL to successfully learn on top of that imitation baseline. The basic approach this model takes is one that will probably feel familiar if you've read Deepmind's prior work on Chess or Go: an interplay between a fast-to-evaluate neural net component, and a slower, more explicit, strategically designed component. The slower "expert" component uses the fast network component as part of its evaluation of different moves' consequences, and then, once the slow expert has generated a series of decisions, the neural net policy learns to imitate those decisions. In this case, the slower expert tries to explicitly calculate a Best Response strategy, given some belief about what your opponents will do at the state you're in. Since the action space is so large, it isn't possible to calculate a full best response (that is, your best *possible* action given the actions you expect your opponents to take), so this paper instead lays out a Sampled Best Response algorithm. It takes as input a state, as well as an opponent policy, a candidate policy, and a value network. (More on how those come to be layer). In the simplest case, the SBR algorithm works by: 1. Sampling some number (B) of actions from the opponent policy given the state. These represent a sample distribution of what moves you think your opponents are likely to play 2. Sampling some number (C) of candidate actions from your candidate policy. 3. For each candidate action, evaluating - for each opponent action - the state you'd reach if you took the candidate action and the opponent took the opponent action, according to your value network. 4. This gives you an estimated Q value for each of your candidate actions; if you pick the action with the highest Q value, this approximates your best response action to the opponent policy distribution you pass in Once you have this SBR procedure, you can use it to bootstrap a better policy and a value network by starting with a policy and value learned from pure human-imitation, and then using SBR - with your current policy as both the opponent and candidate policy - to generate a dataset of trajectories (where each action in the trajectory is chosen according to the SBR procedure). With that trajectory dataset, you can train your policy and value networks to be able to better approximate the actions that SBR would have taken. The basic version of this procedure is called IBR - Iterated Best Response. This is because, at each stage, SBR will return a policy that tries to perform well against the current version of the opponent policy. This kind of behavior is potentially troublesome, since you can theoretically have it lead to cycles, like those in Rock Paper Scissors where paper beats rock, scissors beat paper, and then rock beats scissors again. At each stage, you find the best response to what your opponent is doing right now, but you don't actually make transitive progress. A common way of improving along the axis of this particular failure mode is to learn via a "fictitious play" rather than "self play". Putting aside the name - which I don't find very usefully descriptive - this translates to simulating playing against, not the current version of your opponent, but a distribution made up of past versions of the opponent. This helps prevent cycling, because choosing a strategy that only defeats the current version but would lose to a prior version is no longer a rewarding option. The Fictitious Play-based approach this paper proposes - FPPI2 - tries to resolve this problem. Instead of sampling actions in step (1) from only your current opponent policy, it uses a sampling procedure where, for each opponent action, it first samples a point in history, and then samples the opponent policy vector at that point in history (so the multiple opponent moves collectively represent a coherent point in strategy space). Given this action profile, the final version of the algorithm continues to use the most recent/updated timestep of the policy and value network for the candidate policy and value network used in SBR, so that you're (hopefully) sampling high quality actions, and making accurate assessments of states, even while you use the distribution of (presumably worse) policies to construct the opponent action distribution that you evaluate your candidate actions against. https://i.imgur.com/jrQAAQW.png The authors don't evaluate against human players, but do show that their FPPI2 approach consistently outperforms the prior DipNet state of the art, and performed the best overall, though Iterated Best Response performs better than they say they expected. Some other thoughts: - This system is still fairly exploitable, and doesn't become meaningfully less so during training, though it does become more difficult to exploit quickly - This does seem like a problem overall where you do a lot of modeling of what you expect your opponents strategies to be, and it seems hard to be robust, especially to collusion of your opponents - I was a bit disappointed that the initial framing of the paper put a lot of emphasis on the alliance-focused nature of the game, but then neither suggested mechanisms targeting that aspect of the game, nor seemed to do any specific analysis of - I would have expected this game to benefit from some explicit modeling of different agents having different policies; possibly this just isn't something they could have had be the case under their evaluation scheme, which played against copies of a given policy? Overall, my sense is that this is still a pretty early-stage checkpoint in the effort of playing Diplomacy, and that we've still got a ways to go, but it is interesting early work, and I'm curious where it leads. ![]() |