|
Welcome to ShortScience.org! |
|
|
[link]
If you were to survey researchers, and ask them to name the 5 most broadly influential ideas in Machine Learning from the last 5 years, I’d bet good money that Batch Normalization would be somewhere on everyone’s lists. Before Batch Norm, training meaningfully deep neural networks was an unstable process, and one that often took a long time to converge to success. When we added Batch Norm to models, it allowed us to increase our learning rates substantially (leading to quicker training) without the risk of activations either collapsing or blowing up in values. It had this effect because it addressed one of the key difficulties of deep networks: internal covariate shift. To understand this, imagine the smaller problem, of a one-layer model that’s trying to classify based on a set of input features. Now, imagine that, over the course of training, the input distribution of features moved around, so that, perhaps, a value that was at the 70th percentile of the data distribution initially is now at the 30th. We have an obvious intuition that this would make the model quite hard to train, because it would learn some mapping between feature values and class at the beginning of training, but that would become invalid by the end. This is, fundamentally, the problem faced by higher layers of deep networks, since, if the distribution of activations in a lower layer changed even by a small amount, that can cause a “butterfly effect” style outcome, where the activation distributions of higher layers change more dramatically. Batch Normalization - which takes each feature “channel” a network learns, and normalizes [normalize = subtract mean, divide by variance] it by the mean and variance of that feature over spatial locations and over all the observations in a given batch - helps solve this problem because it ensures that, throughout the course of training, the distribution of inputs that a given layer sees stays roughly constant, no matter what the lower layers get up to. On the whole, Batch Norm has been wildly successful at stabilizing training, and is now canonized - along with the likes of ReLU and Dropout - as one of the default sensible training procedures for any given network. However, it does have its difficulties and downsides. One salient one of these comes about when you train using very small batch sizes - in the range of 2-16 examples per batch. Under these circumstance, the mean and variance calculated off of that batch are noisy and high variance (for the general reason that statistics calculated off of small sample sizes are noisy and high variance), which takes away from the stability that Batch Norm is trying to provide. One proposed alternative to Batch Norm, that didn’t run into this problem of small sample sizes, is Layer Normalization. This operates under the assumption that the activations of all feature “channels” within a given layer hopefully have roughly similar distributions, and, so, you an normalize all of them by taking the aggregate mean over all channels, *for a given observation*, and use that as the mean and variance you normalize by. Because there are typically many channels in a given layer, this means that you have many “samples” that go into the mean and variance. However, this assumption - that the distributions for each feature channel are roughly the same - can be an incorrect one. A useful model I have for thinking about the distinction between these two approaches is the idea that both are calculating approximations of an underlying abstract notion: the in-the-limit mean and variance of a single feature channel, at a given point in time. Batch Normalization is an approximation of that insofar as it only has a small sample of points to work with, and so its estimate will tend to be high variance. Layer Normalization is an approximation insofar as it makes the assumption that feature distributions are aligned across channels: if this turns out not to be the case, individual channels will have normalizations that are biased, due to being pulled towards the mean and variance calculated over an aggregate of channels that are different than them. Group Norm tries to find a balance point between these two approaches, one that uses multiple channels, and normalizes within a given instance (to avoid the problems of small batch size), but, instead of calculating the mean and variance over all channels, calculates them over a group of channels that represents a subset. The inspiration for this idea comes from the fact that, in old school computer vision, it was typical to have parts of your feature vector that - for example - represented a histogram of some value (say: localized contrast) over the image. Since these multiple values all corresponded to a larger shared “group” feature. If a group of features all represent a similar idea, then their distributions will be more likely to be aligned, and therefore you have less of the bias issue. One confusing element of this paper for me was that the motivation part of the paper strongly implied that the reason group norm is sensible is that you are able to combine statistically dependent channels into a group together. However, as far as I an tell, there’s no actually clustering or similarity analysis of channels that is done to place certain channels into certain groups; it’s just done so semi-randomly based on the index location within the feature channel vector. So, under this implementation, it seems like the benefits of group norm are less because of any explicit seeking out of dependant channels, and more that just having fewer channels in each group means that each individual channel makes up more of the weight in its group, which does something to reduce the bias effect anyway. The upshot of the Group Norm paper, results-wise, is that Group Norm performs better than both Batch Norm and Layer Norm at very low batch sizes. This is useful if you’re training on very dense data (e.g. high res video), where it might be difficult to store more than a few observations in memory at a time. However, once you get to batch sizes of ~24, Batch Norm starts to do better, presumably since that’s a large enough sample size to reduce variance, and you get to the point where the variance of BN is preferable to the bias of GN. ![]() |
|
[link]
In this paper, the authors develop a system for automatic as well as an interactive annotation (i.e. segmentation) of a dataset. In the automatic mode, bounding boxes are generated by another network (e.g. FasterRCNN), while in the interactive mode, the input bounding box around an object of interest comes from the human in the loop. The system is composed of the following parts: https://github.com/davidjesusacu/polyrnn-pp/raw/master/readme/model.png 1. **Residual encoder with skip connections**. This step acts as a feature extractor. The ResNet-50 with few modifications (i.e. reducing stride, usage of dilation, removal of average pooling and FC layers) serve as a base CNN encoder. Instead of utilizing the last features of the network, the authors concatenate outputs from different layers - resized to highest feature resolution - to capture multi-level representations. This is shown in the figure below: https://www.groundai.com/media/arxiv_projects/226090/x4.png.750x0_q75_crop.png 2. **Recurrent decoder** is a two-layer ConvLSTM which takes image features, previous (or first) vertex position and outputs one-hot encoding of 28x28 representing possible vertex position, +1 indicates that the polygon is closed (i.e. the end of the sequence). Attention weight per location is utilized using CNN features, 1st and 2nd layers of ConvLSTM. Training is formulated as reinforcement learning since recurrent decoder is considered as sequential decision-making agent. The reward function is IoU between mask generated by the enclosed polygon and ground-truth mask. 3. **Evaluator network** chooses the best polygon among multiple candidates. CNN features, last state tensor of ConvLSTM, and the predicted polygon are used as input, and the output is the predicted IoU. The best polygon is selected from the polygons which are generated using 5 top scoring first vertex predictions. https://i.imgur.com/84amd98.png 4. **Upscaling with Graph Neural Network** takes the list of vertices generated by ConvLSTM decode, adds a node in between two consecutive nodes (to produce finer details at higher resolution), and aims to predict relative offset of each node at a higher resolution. Specifically, it extracts features around every predicted vertex and forwards it through GGNN (Gated Graph Neural Network) to obtain the final location (i.e. offset) of the vertex (treated as classification task). https://www.groundai.com/media/arxiv_projects/226090/x5.png.750x0_q75_crop.png The whole system is not trained end-to-end. While the network was trained on CityScapes dataset, it has shown reasonable generalization to different modalities (e.g. medical data). It would be very nice to observe the opposite generalization of the model. Meaning you train on medical data and see how it performs on CityScapes data. ![]() |
|
[link]
The goal of this work is to perform transfer learning among numerous tasks and to discover visual relationships among them. Specifically, while we intiutively might guess the depth of an image and surface normals are related, this work takes a step forward and discovers a beneficial relationship among 26 tasks in terms of task transferability - many of them are not obvious. This is important for scenarios when an insufficient budget is available for target task for annotation, thus, learned representation from the 'cheaper' task could be used along with small dataset for the target task to reach sufficient performance on par with fully supervised training on a large dataset. The basis of the approach is to compute an affinity matrix among tasks based on whether the solution for one task can be sufficiently easily used for another task. This approach does not impose human intuition about the task relationships and chooses task transferability based on the quality of a transfer operation in a fully computational manner. The task taxonomy (i.e. **taskonomy**) is a computationally found directed hypergraph that captures the notion of task transferability over any given task dictionary. It performed using a four-step process depicted in the figure below:  - In stage I (**Task-specific Modelling**), a task-specific network is trained in a fully supervised manner. The network is composed of the encoder (modified ResNet-50), and fully convolutional decoder for pixel-to-pixel tasks, or 2-3 FC layers for low-dimensional tasks. Dataset consists of 4 million images of indoor scenes from about 600 buildings; every image has an annotation for every task. - In stage II (**Transfer modeling**), all feasible transfers between sources and targets are trained (multiple inputs task to single target transfer is also considered). Specifically, after the task-specific networks are trained in stage I, the weights of an encoder are fixed (frozen network is used to extract representations only) and the representation from the encoder is used to train a small readout network (similar to a decoder from stage I) with a new task as a target (i.e. ground truth is available). In total, about 3000 transfer possibilities are trained. - In stage III (**Taxonomy solver**), the task affinities acquired from the transfer functions performance are normalized. This is needed because different tasks lie in different spaces and transfer function scale. This is performed using ordinal normalization - Analytical Hierarchy Process (details are in the paper - Section 3.3). This results in an affinity matrix where a complete graph of relationships is completely normalized and this graph quantifies a pair-wise set of tasks evaluated in terms of a transfer function (i.e. task dependency). - In stage IV (**Computed Taxonomy**), a hypergraph which can predict the performance of any transfer policy and optimize for the optimal one is synthesized. This is solved using Binary Integer Program as a subgraph selection problem where tasks are nodes and transfers are edges. After the optimization process, the solution devices a connectivity that solves all target tasks, maximizes their collective performance while using only available source tasks under user-specified constraints (e.g. budget). So, if you want to train your network on an unseen task, you can obtain pretrained weights for existing tasks from the [project page](https://github.com/StanfordVL/taskonomy/tree/master/taskbank), train readout functions against each task (as well as combination of multiple inputs), build an affinity matrix to know where your task is positioned against the other ones, and through subgraph selection procedure observe what tasks have favourable influence on your task. Consequently, you can train your task with much less data by utilizing representations from the existing tasks which share visual significance with your task. Magnificent! ![]() |
|
[link]
A finding first publicized by Geoff Hinton is the fact that, when you train a simple, lower capacity module on the probability outputs of another model, you can often get a model that has comparable performance, despite that lowered capacity. Another, even more interesting finding is that, if you take a trained model, and train a model with identical structure on its probability outputs, you can often get a model with better performance than the original teacher, with quicker convergence. This paper addresses, and tries to specifically test, a few theories about why this effect might be observed. One idea is that the "student" model can learn more quickly because getting to see the full probability distribution over a well-trained models outputs gives it a more valuable signal, specifically because the trained model is able to better rank the classes that aren't the true class. For example, if you're training on Imagenet, on an image of a huskies, you're only told "this is a husky (1), and not one of 100 other classes, which are all 0". Whereas a trained model might say "'this is most likely a husky, but the probability of wolf is way higher than that of teapot". This inherently gives you more useful signal to train on, because you’re given a full distribution of classes that an image is most like. This theory goes by the name of the “Dark Knowledge” theory (a truly delightful name), because it pulls all of this knowledge that is hidden in a 0/1 label into the light. An alternative explanation for the strong performance of distillation techniques is that the student model is just benefitting from the implicit importance weighting of having a stronger gradient on examples where the teacher model is more confident. You could think of this as leading the student towards examples that are the most clear or unambiguous examples of a class, rather than more fuzzy and uncertain ones. Along with a few other tests (which I won’t address here, for sake of time and focus), the authors design a few experiments to test these possible mechanisms of action. The first test involved doing an explicit importance weighting of examples according to how confident the teacher model is, but including no information about the incorrect classes. The second was similar, but instead involved perturbing the probabilities of the classes that weren’t the max probability. In this situation, the student model gets some information in terms of the overall magnitudes of the not-max class, but can’t leverage it as usefully because it’s been randomized. In both situations, they found that there still was some value - in other words, that the student outperformed the teacher - but it outperformed by less than the case where the teacher could see the full probability distribution. This supports the case that both the inclusion of probabilities for the less probable classes, as well as the “confidence weighting” effect of weighting the student to learn more from examples on which the “teacher” model was more confident. ![]() |
|
[link]
Proposes a two-stage approach for continual learning. An active learning phase and a consolidation phase. The active learning stage optimizes for a specific task that is then consolidated into the knowledge base network via Elastic Weight Consolidation (Kirkpatrick et al., 2016). The active learning phases uses a separate network than the knowledge base, but is not always trained from scratch - authors suggest a heuristic based on task-similarity. Improves EWC by deriving a new online method so parameters don’t increase linearly with the number of tasks. Desiderata for a continual learning solution: - A continual learning method should not suffer from catastrophic forgetting. That is, it should be able to perform reasonably well on previously learned tasks. - It should be able to learn new tasks while taking advantage of knowledge extracted from previous tasks, thus exhibiting positive forward transfer to achieve faster learning and/or better final performance. - It should be scalable, that is, the method should be trainable on a large number of tasks. - It should enable positive backward transfer as well, which means gaining improved performance on previous tasks after learning a new task which is similar or relevant. - Finally, it should be able to learn without requiring task labels, and ideally, it should even be applicable in the absence of clear task boundaries. Experiments: - Sequential learning of handwritten characters of 50 alphabets taken from the Omniglot dataset. - Sequential learning of 6 games in the Atari suite (Bellemare et al., 2012) (“Space Invaders”, “Krull”, “Beamrider”, “Hero”, “Stargunner” and “Ms. Pac-man”). - 8 navigation tasks in 3D environments inspired by experiments with Distral (Teh et al., 2017). ![]() |