Welcome to ShortScience.org! |
[link]
## Terms * Semantic Segmentation: Traditional segmentation divides the image in visually similar patches. Semantic segmentation on the other hand divides the image in semantically meaningful patches. This usually means to classify each pixel (e.g.: This pixel belongs to a cat, that pixel belongs to a dog, the other pixel is background). ## Main ideas * Complete neural networks which were trained for image classification can be used as a convolution. Those networks can be trained on Image Net (e.g. VGG, AlexNet, GoogLeNet) * Use upsampling to (1) reduce training and prediction time (2) improve consistency of output. (See [What are deconvolutional layers?](http://datascience.stackexchange.com/a/12110/8820) for an explanation.) ## How FCNs work 1. Train a neural network for image classification which is trained on input images of a fixed size ($d \times w \times h$) 2. Interpret the network as a single convolutional filter for each output neuron (so $k$ output neurons means you have $k$ filters) over the complete image area on which the original network was trained. 3. Run the network as a CNN over an image of any size (but at least $d \times w \times h$) with a stride $s \in \mathbb{N}_{\geq 1}$ 4. If $s > 1$, then you need an upsampling layer (deconvolutional layer) to convert the coarse output into a dense output. ## Nice properties * FCNs take images of arbitrary size and produce an image of the same output size. * Computationally efficient ## See also: https://www.quora.com/What-are-the-benefits-of-converting-a-fully-connected-layer-in-a-deep-neural-network-to-an-equivalent-convolutional-layer > They allow you to treat the convolutional neural network as one giant filter. You can then spatially apply the neural net as a convolution to images larger than the original training image size, getting a spatially dense output. > > Let's say you train a neural net (with some loss function) with a convolutional layer (3 x 3, stride of 2), pooling layer (3 x 3, stride of 2), and a fully connected layer with 10 units, using 25 x 25 images. Note that the receptive field size of each max pooling unit is 7 x 7, so the pooling output is 5 x 5. You can convert the fully connected layer to to a set of 10 5 x 5 convolutional filters (unit strides). If you do that, the entire net can be treated as a filter with receptive field size 35 x 35 and stride of 4. You can then take that net and apply it to a 50 x 50 image, and you'd get a 3 x 3 x 10 spatially dense output.
1 Comments
|
[link]
So the hypervector is just a big vector created from a network: `"We concatenate features from some or all of the feature maps in the network into one long vector for every location which we call the hypercolumn at that location. As an example, using pool2 (256 channels), conv4 (384 channels) and fc7 (4096 channels) from the architecture of [28] would lead to a 4736 dimensional vector."` So how exactly do we construct the vector? ![](https://i.imgur.com/hDvHRwT.png) Each activation map results in a single element of the resulting hypervector. The corresponding pixel location in each activation map is used as if the activation maps were all scaled to the size of the original image. The paper shows the below formula for the calculation. Here $\mathbf{f}_i$ is the value of the pixel in the scaled space and each $\mathbf{F}_{k}$ are points in the activation map. $\alpha_{ik}$ scales the known values to produce the midway points. $$\mathbf{f}_i = \sum_k \alpha_{ik} \mathbf{F}_{k}$$ Then the fully connected layers are simply appended to complete the vector. So this gives us a representation for each pixel but is it a good one? The later layers will have the input pixel in their receptive field. After the first few layers it is expected that the spatial constraint is not strong. |
[link]
Narodytska and Kasiviswanathan propose a local search-based black.box adversarial attack against deep networks. In particular, they address the problem of k-misclassification defined as follows: Definition (k-msiclassification). A neural network k-misclassifies an image if the true label is not among the k likeliest labels. To this end, they propose a local search algorithm which, in each round, randomly perturbs individual pixels in a local search area around the last perturbation. If a perturbed image satisfies the k-misclassificaiton condition, it is returned as adversarial perturbation. While the approach is very simple, it is applicable to black-box models where gradients and or internal representations are not accessible but only the final score/probability is available. Still the approach seems to be quite inefficient, taking up to one or more seconds to generate an adversarial example. Unfortunately, the authors do not discuss qualitative results and do not give examples of multiple adversarial examples (except for the four in Figure 1). https://i.imgur.com/RAjYlaQ.png Figure 1: Examples of adversarial attacks. Top: original image, bottom: perturbed image. |
[link]
Lee et al. propose a generative model for obtaining confidence-calibrated classifiers. Neural networks are known to be overconfident in their predictions – not only on examples from the task’s data distribution, but also on other examples taken from different distributions. The authors propose a GAN-based approach to force the classifier to predict uniform predictions on examples not taken from the data distribution. In particular, in addition to the target classifier, a generator and a discriminator are introduced. The generator generates “hard” out-of-distribution examples; ideally these examples are close to the in-distribution, i.e., the data distribution of the actual task. The discriminator is intended to distinguish between out- and in-distribution. The overall algorithm, including the necessary losses, is given in Algorithm 1. In experiments, the approach is shown to allow detecting out-distribution examples nearly perfectly. Examples of the generated “hard” out-of-distribution samples are given in Figure 1. https://i.imgur.com/NmF0fpN.png Algorithm 1: The proposed joint training scheme of out-distribution generator $G$, the in-/out-distribution discriminator $G$ and the original classifier providing $P_\theta$(y|x)$ with parameters $\theta$. https://i.imgur.com/kAclSQz.png Figure 1: A comparison of a regular GAN (a and c) to the proposed framework (c and d). Clearly, the proposed approach generates out-of-distribution samples (i.e., no meaningful digits) close to the original data distribution. |
[link]
This paper describes a learning algorithm for deep neural networks that can be understood as an extension of stacked denoising autoencoders. In short, instead of reconstructing one layer at a time and greedily stacking, a unique unsupervised objective involving the reconstruction of all layers is optimized jointly by all parameters (with the relative importance of each layer cost controlled by hyper-parameters). In more details: * The encoding (forward propagation) adds noise (Gaussian) at all layers, while decoding is noise-free. * The target at each layer is the result of noise-less forward propagation. * Direct connections (also known as skip-connections) between a layer and its decoded reconstruction are used. The resulting encoder/decoder architecture thus ressembles a ladder (hence the name Ladder Networks). * Miniature neural networks with a single hidden unit and skip-connections are used to decode the left and top layers into a reconstruction. Each network is applied element-wise (without parameter sharing across reconstructed units). * The unsupervised objective is combined with a supervised objective, corresponding to the regular negative class log-likelihood objective (using an output softmax layer). Two losses are used for each input/target pair: one based on the noise-free forward propagation (which also provides the target of the denoising objective) and one with the noise added (which also corresponds to the encoding stage of the unsupervised autoencoder objective). Batch normalization is used to train the network. Since the model combines unsupervised and supervised learning, it can be used for semi-supervised learning, where unlabeled examples can be used to update the network using the unsupervised objective only. State of the art results in the semi-supervised setting are presented, for both the MNIST and CIFAR-10 datasets. #### My two cents What I find most exciting about this paper is its performance. On MNIST, with only 100 labeled examples, it achieves 1.13% error! That is essentially the performance of stacked denoising autoencoders, trained on the entire training set (though that was before ReLUs and batch normalization, which this paper uses)! This confirms a current line of thought in Deep Learning (DL) that, while recent progress in DL applied on large labeled datasets does not rely on any unsupervised learning (unlike at the "beginning" of DL in the mid 2000s), unsupervised learning might instead be crucial for success in low-labeled data regime, in the semi-supervised setting. Unfortunately, there is one little issue in the experiments, disclosed by the authors: while they used few labeled examples for training, model selection did use all 10k labels in the validation set. This is of course unrealistic. But model selection in the low data regime is arguably, in itself, an open problem. So I like to think that this doesn't invalidate the progress made in this paper, and only suggests that some research needs to be done on doing effective hyper-parameter search with a small validation set. Generally, I really hope this paper will stimulate more research on DL methods to the specific case of small labeled dataset / large unlabeled dataset. While this isn't a problem that is as "flashy" as tasks such as the ImageNet Challenge which comes with lots of labeled data, I think this is a crucial research direction for AI in general. Indeed, it seems naive to me to expect that we will be able to collect large labeled dataset for each and every task, on our way to real AI. |