![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixel-wise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction (see [Deep Neural Networks for Object Detection](http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf)). The paper introduces RPNs (Region Proposal Networks). They are end-to-end trained to generate region proposals.They simoultaneously regress region bounds and bjectness scores at each location on a regular grid. RPNs are one type of fully convolutional networks. They take an image of any size as input and output a set of rectangular object proposals, each with an objectness score. ## See also * [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma) * [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17) ![]() |
[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. ![]() |
[link]
This paper models object detection as a regression problem for bounding boxes and object class probabilities with a single pass through the CNN. The main contribution is the idea of dividing the image into a 7x7 grid, and having each cell predict a distribution over class labels as well as a bounding box for the object whose center falls into it. It's much faster than R-CNN and Fast R-CNN, as the additional step of extracting region proposals has been removed. ## Strengths - Works real-time. Base model runs at 45fps and a faster version goes up to 150fps, and they claim that it's more than twice as fast as other works on real-time detection. - End-to-end model; Localization and classification errors can be jointly optimized. - YOLO makes more localization errors and fewer background mistakes than Fast R-CNN, so using YOLO to eliminate false background detections from Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN is much slower). ## Weaknesses / Notes - Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN). - Performs worse at detecting small objects, as at most one object per grid cell can be detected. ![]() |
[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) ![]() |
[link]
This work expands on prior techniques for designing models that can both be stored using fewer parameters, and also execute using fewer operations and less memory, both of which are key desiderata for having trained machine learning models be usable on phones and other personal devices. The main contribution of the original MobileNets paper was to introduce the idea of using "factored" decompositions of Depthwise and Pointwise convolutions, which separate the procedures of "pull information from a spatial range" and "mix information across channels" into two distinct steps. In this paper, they continue to use this basic Depthwise infrastructure, but also add a new design element: the inverted-residual linear bottleneck. The reasoning behind this new layer type comes from the observation that, often, the set of relevant points in a high-dimensional space (such as the 'per-pixel' activations inside a conv net) actually lives on a lower-dimensional manifold. So, theoretically, and naively, one could just try to use lower dimensional internal representations to map the dimensionality of that assumed manifold. However, the authors argue that ReLU non-linearities kill information (because of the region where all inputs are mapped to zero), and so having layers contain only the number of dimensions needed for the manifold would mean that you end up with too-few dimensions after the ReLU information loss. However, you need to have non-linearities somewhere in the network in order to be able to learn complex, non-linear functions. So, the authors suggest a method to mostly use smaller-dimensional representations internally, but still maintain ReLus and the network's needed complexity. https://i.imgur.com/pN4d9Wi.png - A lower-dimensional output is "projected up" into a higher dimensional output - A ReLu is applied on this higher-dimensional layer - That layer is then projected down into a smaller-dimensional layer, which uses a linear activation to avoid information loss - A residual connection between the lower-dimensional output at the beginning and end of the expansion This way, we still maintain the network's non-linearity, but also replace some of the network's higher-dimensional layers with lower-dimensional linear ones ![]() |