![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
[code](https://github.com/openai/improved-gan), [demo](http://infinite-chamber-35121.herokuapp.com/cifar-minibatch/1/?), [related](http://www.inference.vc/understanding-minibatch-discrimination-in-gans/) ### Feature matching problem: overtraining on the current discriminator solution: $||E_{x \sim p_{\text{data}}}f(x) - E_{z \sim p_{z}(z)}f(G(z))||_{2}^{2}$ were f(x) activations intermediate layer in discriminator ### Minibatch discrimination problem: generator to collapse to a single point solution: for each sample i, concatenate to $f(x_i)$ features $b$ measuring its distance to other samples j (i and j are both real or generated samples in same batch): $\sum_j \exp(-||M_{i, b} - M_{j, b}||_{L_1})$  this generates visually appealing samples very quickly ### Historical averaging problem: SGD fails by going into extended orbits solution: parameters revert to the mean $|| \theta - \frac{1}{t} \sum_{i=1}^t \theta[i] ||^2$  ### One-sided label smoothing problem: discriminator vulnerability to adversarial examples solution: discriminator target for positive samples is 0.9 instead of 1 ### Virtual batch normalization problem: using BN cause output of examples in batch to be dependent solution: use reference batch chosen once at start of training and each sample is normalized using itself and the reference. It's expensive so used only on generation ### Assessment of image quality problem: MTurk not reliable solution: use inception model p(y|x) to compute $\exp(\mathbb{E}_x \text{KL}(p(y | x) || p(y)))$ on 50K generated images x ### Semi-supervised learning use the discriminator to also classify on K labels when known and use all real samples (labels and unlabeled) in the discrimination task $D(x) = \frac{Z(x)}{Z(x) + 1}, \text{ where } Z(x) = \sum_{k=1}^{K} \exp[l_k(x)]$. In this case use feature matching but not minibatch discrimination. It also improves the quality of generated images. ![]()
3 Comments
|
[link]
`Update 2015/11/23: Since I first wrote this note, I became involved in the next iterations of this work, which became v2 of the arXiv manuscript. The notes below were made based on v1.` This paper considers the problem of Maximum Inner Product Search (MIPS). In MIPS, given a query $q$ and a set of inputs $x_i$, we want to find the input (or the top n inputs) with highest inner product, i.e. $argmax_i q' x_i$. Recently, it was shown that a simple transformation to the query and input vectors made it possible to approximately solve MIPS using hashing methods for Maximum Cosine Similarity Search (MCSS), a problem for which solutions are readily available (see section 2.4 for a brief but very clear description of the transformation). In this paper, the authors combine this approach with clustering, in order to improve the quality of retrieved inputs. Specifically, they consider the spherical k-means algorithm, which is a variant of k-means in which data points are clustered based on cosine similarity instead of the euclidean similarity (in short, data points are first scaled to be of unit norm, then in the training inner loop points are assigned to the cluster centroid with highest dot product and cluster centroids are updated as usual, except that they are always rescaled to unit norm). Moreover, they consider a bottom-up application of the algorithm to yield a hierarchical clustering tree. They propose to use such a hierarchical clustering tree to find the top-n candidates for MIPS. The key insight here is that, since spherical k-means relies on cosine similarity for finding the best cluster, and since we have a transformation that allows the maximisation of inner product to be approximated by the maximisation of cosine similarity, then a tree to find MIPS candidates could be constructed by running spherical k-means on the inputs transformed by the same transformation used for hashing-based MIPS. In order to make the search more robust to border issues when a query is close to the frontier between clusters, at each level of the tree they consider more than one candidate cluster during top-down search, so as to merge the candidates in several leaves of the tree at the very end of a full top down query. Their experiments using search with word embeddings show that the quality of the top 1, 10 and 100 MIPS candidates using their spherical k-means approach is better than using two hashing-based search methods. ![]() |
[link]
This paper presents an approach to visual question answering by dynamically composing networks of independent neural modules based on the semantic parsing of the question. Main contributions: - Independent neural modules that can be combined together and jointly trained. - Attention: Convolutional layer, with different filters for different instances. For example, attend[dog], attend[cat], etc. - Re-attention: FC-ReLU-FC-ReLU, weights are different for different instances. For example, re-attend[above], re-attend[not], etc. - Combination: Stacks two attention maps, followed by conv-ReLU to map to a single attention map. For example, combine[and], combine[except], etc. - Classification: Combines attention map and image, followed by FC-Softmax to map to answer. For example, classify[colors]. - Measurement: FC-ReLU-FC-Softmax, takes attention map as input. For example, measure[exists]. - Structured representations are extracted from questions and these are then mapped to network layouts, including the connections between them. - All leaves become attend modules, all internal nodes become re-attend or combine modules dependent on their arity, and root nodes become measure modules for yes/no questions and classify modules for all other question types. - Networks with the same structure but different instantiations can be processed in the same batch. For example, classify[color]\(attend[cat]\), classify[where]\(attend[truck]\). - Predictions from the module network are combined with LSTM representations to get the final answer. - Syntactic regularities: 'what is flying?' and 'what are flying?' get mapped to the same module network. - Semantic regularities: 'green' is an implausible answer for 'what color is the bear?'. - Experiments are performed on the synthetic SHAPES dataset and VQA dataset. - Performance on the SHAPES dataset is better as it is designed to benefit from compositionality. ## Strengths - This model takes advantage of the inherently compositional property of language, which makes a lot of sense. VQA is an extremely complex task and breaking it up into separate functions/modules is an excellent approach. ## Weaknesses / Notes - Mapping from syntactic structure to module network is hand-designed. Ideally, the model should learn this too to generalize. - Due to its compositional nature, this kind of model can possibly be used in the zero-shot learning setting, i.e. generalize to novel question types that the network hasn't seen before. ![]() |
[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) ![]() |
[link]
If you were to survey researchers, and ask them to name the 5 most broadly influential ideas in Machine Learning from the last 5 years, I’d bet good money that Batch Normalization would be somewhere on everyone’s lists. Before Batch Norm, training meaningfully deep neural networks was an unstable process, and one that often took a long time to converge to success. When we added Batch Norm to models, it allowed us to increase our learning rates substantially (leading to quicker training) without the risk of activations either collapsing or blowing up in values. It had this effect because it addressed one of the key difficulties of deep networks: internal covariate shift. To understand this, imagine the smaller problem, of a one-layer model that’s trying to classify based on a set of input features. Now, imagine that, over the course of training, the input distribution of features moved around, so that, perhaps, a value that was at the 70th percentile of the data distribution initially is now at the 30th. We have an obvious intuition that this would make the model quite hard to train, because it would learn some mapping between feature values and class at the beginning of training, but that would become invalid by the end. This is, fundamentally, the problem faced by higher layers of deep networks, since, if the distribution of activations in a lower layer changed even by a small amount, that can cause a “butterfly effect” style outcome, where the activation distributions of higher layers change more dramatically. Batch Normalization - which takes each feature “channel” a network learns, and normalizes [normalize = subtract mean, divide by variance] it by the mean and variance of that feature over spatial locations and over all the observations in a given batch - helps solve this problem because it ensures that, throughout the course of training, the distribution of inputs that a given layer sees stays roughly constant, no matter what the lower layers get up to. On the whole, Batch Norm has been wildly successful at stabilizing training, and is now canonized - along with the likes of ReLU and Dropout - as one of the default sensible training procedures for any given network. However, it does have its difficulties and downsides. One salient one of these comes about when you train using very small batch sizes - in the range of 2-16 examples per batch. Under these circumstance, the mean and variance calculated off of that batch are noisy and high variance (for the general reason that statistics calculated off of small sample sizes are noisy and high variance), which takes away from the stability that Batch Norm is trying to provide. One proposed alternative to Batch Norm, that didn’t run into this problem of small sample sizes, is Layer Normalization. This operates under the assumption that the activations of all feature “channels” within a given layer hopefully have roughly similar distributions, and, so, you an normalize all of them by taking the aggregate mean over all channels, *for a given observation*, and use that as the mean and variance you normalize by. Because there are typically many channels in a given layer, this means that you have many “samples” that go into the mean and variance. However, this assumption - that the distributions for each feature channel are roughly the same - can be an incorrect one. A useful model I have for thinking about the distinction between these two approaches is the idea that both are calculating approximations of an underlying abstract notion: the in-the-limit mean and variance of a single feature channel, at a given point in time. Batch Normalization is an approximation of that insofar as it only has a small sample of points to work with, and so its estimate will tend to be high variance. Layer Normalization is an approximation insofar as it makes the assumption that feature distributions are aligned across channels: if this turns out not to be the case, individual channels will have normalizations that are biased, due to being pulled towards the mean and variance calculated over an aggregate of channels that are different than them. Group Norm tries to find a balance point between these two approaches, one that uses multiple channels, and normalizes within a given instance (to avoid the problems of small batch size), but, instead of calculating the mean and variance over all channels, calculates them over a group of channels that represents a subset. The inspiration for this idea comes from the fact that, in old school computer vision, it was typical to have parts of your feature vector that - for example - represented a histogram of some value (say: localized contrast) over the image. Since these multiple values all corresponded to a larger shared “group” feature. If a group of features all represent a similar idea, then their distributions will be more likely to be aligned, and therefore you have less of the bias issue. One confusing element of this paper for me was that the motivation part of the paper strongly implied that the reason group norm is sensible is that you are able to combine statistically dependent channels into a group together. However, as far as I an tell, there’s no actually clustering or similarity analysis of channels that is done to place certain channels into certain groups; it’s just done so semi-randomly based on the index location within the feature channel vector. So, under this implementation, it seems like the benefits of group norm are less because of any explicit seeking out of dependant channels, and more that just having fewer channels in each group means that each individual channel makes up more of the weight in its group, which does something to reduce the bias effect anyway. The upshot of the Group Norm paper, results-wise, is that Group Norm performs better than both Batch Norm and Layer Norm at very low batch sizes. This is useful if you’re training on very dense data (e.g. high res video), where it might be difficult to store more than a few observations in memory at a time. However, once you get to batch sizes of ~24, Batch Norm starts to do better, presumably since that’s a large enough sample size to reduce variance, and you get to the point where the variance of BN is preferable to the bias of GN. ![]() |