[link]
The paper proposes a method to perform joint instance and semantic segmentation. The method is fast as it is meant to run in an embedded environment (such as a robot). While the semantic map may seem redundant given the instance one, it is not as semantic segmentation is a key part of obtaining the instance map. # Architecture  The image is first put through a typical CNN encoder (specifically a ResNet derivative), followed by 3 separate decoders. The output of the decoder is at a low resolution for faster processing. Decoders: - Semantic segmentation: coupled with the encoder, it's U-Net-like. The output is a segmentation map. - Instance center: for each pixel, outputs the confidence that it is the center of an object. - Embedding: for each pixel, computes a 32 dimensional embedding. This embedding must have a low distance to embedding of other pixels of the same instance, and high distance to embedding of other pixels. To obtain the instance map, the segmentation map is used to mask the other 2 decoder outputs to separate the embeddings and centers of each class. Centers are thresholded at 0.7, and centers with embedding distances lower than a set amount are discarded, as they are considered duplicates. Then for each class, a similarity matrix is computed between all pixels from that class and centers from that class. Pixels are assigned to their closest centers, which represent different instances of the class. Finally, the segmentation and instance maps are upsampled using the SLIC algorithm. # Loss There is one loss for each decoder head. - Semantic segmentation: weighted cross-entropy - Instance center: cross-entropy term modulated by a $\gamma$ parameter to counter the over-representation of the background over the target classes.  - Embedding: composed of 3 parts, an attracting force between embeddings of the same instance, a repelling force between embeddings of different instances, and a l2 regularization on the embedding.   $\hat{e}$ are the embeddings, $\delta_a$ is an hyper-parameter defining "close enough", and $\delta_b$ defines "far enough" The whole model is trained jointly using a weighted sum of the 3 losses. # Experiments and results The authors test their method on the Cityscape dataset, which is composed of 5000 annotated images and 8 instance classes. They compare their methods both for semantic segmentation and instance segmentation.  For semantic segmentation, their method is ok, though ENet for example performs better on average and is much faster.  On the other hand, for instance segmentation, their method is much faster than the other while still performing well. Not SOTA on performance, but considering the real-time constraint, it's much better. # Comments - Most instance segmentation methods tend to be sluggish and overly complicated. This approach is much more elegant in my opinion. - If they removed the aggressive down/up sampling, I wonder if they would beat MaskRCNN and PANet. - I'm not sure what's the point of upsampling the semantic map given that we already have the instance map. ![]() |
[link]
The paper designs some basic tests to compare saliency methods. It founds that some of the most popular methods are independent of model parameters and the data, meaning they are effectively useless. ## Methods compared The paper compare the following methods: gradient explanation, gradient x input, integrated gradients, guided backprop, guided GradCam and SmoothGrad. They provide a refresher on those methods in the appendix. All those methods can be put in the same framework. They require a classification model and an input (typically an image). The output of the method is an *explanation map* of the shape of the input where a higher value for a feature implies greater relevance in the decision of the model. ## Metrics of comparison The authors argue that visual inspection of the saliency maps can be misleading. They propose to compute the Spearman rank correlation, the structural similarity index (SSMI) and the Pearson correlation of the histogram of gradients. The authors point out that those metrics capture various notions of similarity, but it is an active area of research and those metrics are imperfect. ## First test: model parameters randomization A saliency method must be dependent of model parameters, otherwise it cannot help us understand a model. In this test, the authors randomize the model parameters, layer per layer, starting from the top. Surprisingly, methods such as guided backprop and guided gradcam are completely insensitive to model parameters, as illustrated on this Inception v3 trained on ImageNet:  Integrated gradients looks also dubious as the bird is still visible with a mostly fully randomized model, but the quantitative metrics reveal the difference is actually big between the two models. ## Second test: data randomization It is well-known that randomly shuffling the labels of a dataset does not prevent a neural network from getting a high accuracy on the training set, though it does prevent generalization. The model is able to learn by either memorizing the data or finding spurious patterns. As a result, saliency maps obtained from such a network should have no clearly interpretable signal. Here is the result for a ConvNet trained on MNIST and a shuffled MNIST:  The results are very damning for most methods. Only gradients and GradCam are very different between both models, as confirmed by the low correlation. ## Discussion - Even though some methods do no depend on model parameters and data, they might still depend on the architecture of the models, which could be of some use in some contexts. - Methods that multiply the input with the gradient are dominated by the input. - Complex saliency methods are just fancy edge detectors. - Only gradient, smooth gradient and GradCam survives the sanity checks. # Comments - Why is their GradCam maps so ugly? They don't look like usual GradCam maps at all. - Their tests are simple enough that it's hard to defend a method that doesn't pass them. - The methods that are left are not very good either. They give fuzzy maps that are difficult to interpret. - In the case of integrated gradients (IG), I'm not convinced this is sufficient to discard the method. IG requires a "baseline input" that represents the absence of features. In the case of images, people usually just set the image to 0, which is not at all the absence of a feature. The authors also use the "set the image to 0" strategy, and I'd say their tests are damning for this strategy, not for IG in general. I'd expect an estimation of the baseline such as done in [this paper](https://arxiv.org/abs/1702.04595) would be a fairer evaluation of IG. Code: [GitHub](https://github.com/adebayoj/sanity_checks_saliency) (not available as of 17/07/19) ![]() |
[link]
This paper considers "the problem of learning logical structure [...] as expressed by satisfiability problems". This is an attempt at incorporating symbolic AI into neural networks. The key contribution of the paper is the introduction of "a differentiable smoothed MAXSAT solver", that is able to learn logical relationships from examples. The example given in the paper is Sudoku. The proposed model is able to learn jointly the rules of the game and how to solve the puzzles, **without prior on the rules**. The core of the system is a new layer that learns satisfiability constraints while being differentiable. This layer can be embedded in a typical ConvNet:  Previous attempts to solve Sudoku with a neural network were unsuccessful. The networks were able to reach a high accuracy on the training set but were completely unable to generalize on new puzzles, showing they were unable to learn the underlying logic. SATNet reaches 99% test accuracy on an encoded representation of Sudoku puzzles and 63% test accuracy on images of Sudoku puzzles. # Comments This layer can probably be used to solve other puzzle games, but I'm not familiar enough with SAT to know what kind of practical problems can be solved with this system. Operational research problems maybe? Code: https://github.com/locuslab/SATNet ![]() |
[link]
This paper was presented at ICML 2019. Do you remember greedy layer-wise training? Are you curious what a modern take on the idea can achieve? This is the paper for you then. And it has its own very good summary: > We use standard convolutional and fully connected network architectures, but instead of globally back-propagating errors, each weight layer is trained by a local learning signal,that is not back-propagated down the network. The learning signal is provided by two separate single-layer sub-networks, each with their own distinct loss function. One sub-network is trained with a standard cross-entropy loss, and the other with a similarity matching loss. If it's a bit unclear, this figure might help:  The cross-entropy loss is the standard classification loss. The similarity loss is between the output of the layer and the one-hot encoded labels: $$ L_{\mathrm{sim}}=\|\| S(\text { NeuralNet }(H))-S(Y) \|\|_{F}^{2} $$ The similarity is a cosine similarity matrix $S$ where the elements are: $$ s_{i j}=s_{j i}=\frac{\tilde{\mathbf{x}}_{i}^{T} \tilde{\mathbf{x}}_{j}}{\|\|\widetilde{\mathbf{x}}_{i}\|\|_{2}\|\|\widetilde{\mathbf{x}}_{j}\|\|_{2}} $$ The method is used to train VGG-like models on MNIST, Fashion-MNIST, CIFAR-10 and 100, SVHN and STL-10. While it gets near-SOTA up to CIFAR-10, it's not there yet for more complex datasets. It gets 80% accuracy on CIFAR-100 where SOTA is 90% accuracy. Still, this is better than a standard ResNet for example. Why would we prefer a local loss to a global loss? A big advantage is that the weights can be updated during the forward pass, thus avoiding storing the activations in memory. There was another paper on a similar topic, which I didn't read: [Greedy Layerwise Learning Can Scale to ImageNet](https://arxiv.org/abs/1812.11446). # Comments - While this is clearly not ready to just replace standard backprop, I find this line of work very interesting as it casts a doubt on one of the assumption of backprop: that we need a global signal to learn complex functions. - Though not mentioned in the paper, wouldn't a local loss naturally avoid vanishing and exploding gradients? ![]()
1 Comments
|
[link]
Natural images can be decomposed in frequencies, higher frequencies contain small changes and details, while lower frequencies contain the global structure. We can see an example in this image:  Each filter of a convolutional layer focuses on different frequencies of the image. This paper proposes a way to group them explicitly into high and low frequency filters. To do that, the low frequency group is reduced spatially by 2 in all dimensions (which they define as an octave), before applying the convolution. The spatial reduction, which is a pooling operation, makes sense as it is a low pass filter, small details are discarded but the global structure is kept. More concretely, the layer takes as input two groups of feature maps, one with a higher resolution than the other. The output is also two groups of feature maps, separated as high/low frequencies. Information is exchanged between the two groups by pooling or upsampling as needed, and as is shown on this image:  The proportion of high and low frequency feature maps is controlled through a single parameter, and through testing the authors found that having around 25% of low frequency features gives the best performance. One important fact about this layer is that it can simply be used as replacement for a standard convolutional layer, and thus does not require other changes to the architecture. They test on various ResNets, DenseNets and MobileNets. In terms of tasks, they get performance near state-of-the-art on [ImageNet top-1](https://paperswithcode.com/sota/image-classification-on-imagenet) and top-5. So why use this octave convolution? Because it reduces the amount of memory and computation required by the network. # Comments - I would have liked to see more groups of varying frequencies. Since an octave is a spatial reduction of 2^n, the authors could do the same with n > 1. I expect this will be addressed in future work. - While the results are not quite SOTA, octave convolutions seem compatible with EfficientNet, and I expect this would improve the performance of both. - Since each octave convolution layer outputs a multi-scale representation of the input, doesn't that mean that pooling becomes less necessary in a network? If so, octave convolutions would give better performances on a new architecture optimized for them. Code: [Official](https://github.com/facebookresearch/OctConv), [all implementations](https://paperswithcode.com/paper/drop-an-octave-reducing-spatial-redundancy-in) ![]() |
[link]
Batch Normalization doesn't work well when using small batch sizes, which is often required for memory intensive tasks such as detection or segmentation, or memory intensive data such as 3D images, videos or high-res images. Group Normalization is a simple alternative that is independent of the batch size:  It works as BN, except with a different set of features for computing the mean and std:  The $\gamma$ and $\beta$ are learned per group and applied as usual:  A group is defined as a set of channels, and the mean and std is computed for that set of channels for one sample, as illustrated:  By default, there are 32 groups, but they show GN works well as long as there is more than one group but less than the number of channels. In term of experiments, they try on ImageNet classification, detection and segmentation in COCO, and video classification in Kinetics. The conclusion is that **GN results in the same performance no matter the batch size, and that performance is the same as BN in large batches.** The most impressive result is a 10% increase in accuracy on ImageNet with a batch size of 2 over BN. # Comments - This paper got an honorable mention at ECCV 2018. - I don't understand how it works at the entrance of the network, when there is only 1 or 3 channels. Are we just not supposed to put GN there? - Also, the number of channels tends to increase in the network, but the number of groups stays fixed. Should it scale with the number of channels? - They tested GN on many tasks, but mostly on ResNet. There was only one experiment on VGG-16, where they found no big difference with BN. For now I'm not convinced GN is useful outside of ResNet. Code: https://github.com/facebookresearch/Detectron/tree/master/projects/GN ![]() |