Welcome to ShortScience.org! |
[link]
## Terms * Semantic Segmentation: Traditional segmentation divides the image in visually similar patches. Semantic segmentation on the other hand divides the image in semantically meaningful patches. This usually means to classify each pixel (e.g.: This pixel belongs to a cat, that pixel belongs to a dog, the other pixel is background). ## Main ideas * Complete neural networks which were trained for image classification can be used as a convolution. Those networks can be trained on Image Net (e.g. VGG, AlexNet, GoogLeNet) * Use upsampling to (1) reduce training and prediction time (2) improve consistency of output. (See [What are deconvolutional layers?](http://datascience.stackexchange.com/a/12110/8820) for an explanation.) ## How FCNs work 1. Train a neural network for image classification which is trained on input images of a fixed size ($d \times w \times h$) 2. Interpret the network as a single convolutional filter for each output neuron (so $k$ output neurons means you have $k$ filters) over the complete image area on which the original network was trained. 3. Run the network as a CNN over an image of any size (but at least $d \times w \times h$) with a stride $s \in \mathbb{N}_{\geq 1}$ 4. If $s > 1$, then you need an upsampling layer (deconvolutional layer) to convert the coarse output into a dense output. ## Nice properties * FCNs take images of arbitrary size and produce an image of the same output size. * Computationally efficient ## See also: https://www.quora.com/What-are-the-benefits-of-converting-a-fully-connected-layer-in-a-deep-neural-network-to-an-equivalent-convolutional-layer > They allow you to treat the convolutional neural network as one giant filter. You can then spatially apply the neural net as a convolution to images larger than the original training image size, getting a spatially dense output. > > Let's say you train a neural net (with some loss function) with a convolutional layer (3 x 3, stride of 2), pooling layer (3 x 3, stride of 2), and a fully connected layer with 10 units, using 25 x 25 images. Note that the receptive field size of each max pooling unit is 7 x 7, so the pooling output is 5 x 5. You can convert the fully connected layer to to a set of 10 5 x 5 convolutional filters (unit strides). If you do that, the entire net can be treated as a filter with receptive field size 35 x 35 and stride of 4. You can then take that net and apply it to a 50 x 50 image, and you'd get a 3 x 3 x 10 spatially dense output.
1 Comments
|
[link]
So the hypervector is just a big vector created from a network: `"We concatenate features from some or all of the feature maps in the network into one long vector for every location which we call the hypercolumn at that location. As an example, using pool2 (256 channels), conv4 (384 channels) and fc7 (4096 channels) from the architecture of [28] would lead to a 4736 dimensional vector."` So how exactly do we construct the vector? ![](https://i.imgur.com/hDvHRwT.png) Each activation map results in a single element of the resulting hypervector. The corresponding pixel location in each activation map is used as if the activation maps were all scaled to the size of the original image. The paper shows the below formula for the calculation. Here $\mathbf{f}_i$ is the value of the pixel in the scaled space and each $\mathbf{F}_{k}$ are points in the activation map. $\alpha_{ik}$ scales the known values to produce the midway points. $$\mathbf{f}_i = \sum_k \alpha_{ik} \mathbf{F}_{k}$$ Then the fully connected layers are simply appended to complete the vector. So this gives us a representation for each pixel but is it a good one? The later layers will have the input pixel in their receptive field. After the first few layers it is expected that the spatial constraint is not strong. |
[link]
#### Introduction * The paper presents a suite of benchmark tasks to evaluate end-to-end dialogue systems such that performing well on the tasks is a necessary (but not sufficient) condition for a fully functional dialogue agent. * [Link to the paper](https://research.facebook.com/publications/evaluating-prerequisite-qualities-for-learning-end-to-end-dialog-systems/) #### Dataset * Created using large-scale real-world sources - OMDB (Open Movie Database), MovieLens and Reddit. * Consists of ~75K movie entities and ~3.5M training examples. #### Tasks ##### QA Task * Answering Factoid Questions without relation to the previous dialogue. * KB(Knowledge Base) created using OMDB and stored as triplets of the form (Entity, Relation, Entity). * Question (in Natural Language Form) generated by creating templates using [SimpleQuestions](https://arxiv.org/abs/1506.02075) * Instead of giving out just 1 response, the system ranks all the answers in order of their relevance. ##### Recommendation Task * Providing personalised responses to the user via recommendation instead of providing universal facts as in case 1. * MovieLens dataset with a *user x item* matrix of ratings. * Statements (for any user) are generated by sampling highly ranked movies by the user and forming a statement about these movies using natural language templates. * Like the previous case, a list of ranked responses is generated. ##### QA + Recommendation Task * Maintaining short dialogues involving both factoid and personalised content. * Dataset consists of short conversations of 3 exchanges (3 from each participant). ##### Reddit Discussion Task * Identify most likely response is discussions on Reddit. * Data processed to flatten the potential conversation so that it appears to be a two participant conversation. ##### Joint Task * Combines all the previous tasks into one single task to test all the skills at once. #### Models Tested * **Memory Networks** - Comprises of a memory component that includes both long term memory and short term context. * **Supervised Embedding Models** - Sum the word embeddings of the input and the target independently and compare them with a similarity metric. * **Recurrent Language Models** - RNN, LSTM, SeqToSeq * **Question Answering Systems** - Systems that answer natural language questions by converting them into search queries over a KB. * **SVD(Singular Value Decomposition)** - Standard benchmark for recommendation. * **Information Retrieval Models** - Given a message, find the most similar message in the training dataset and report its output or find a most similar response to input directly. #### Result ##### QA Task * QA System > Memory Networks > Supervised Embeddings > LSTM ##### Recommendation Task * Supervised Embeddings > Memory Networks > LSTM > SVD ##### Task Involving Dialog History * QA + Recommendation Task and Reddit Discussion Task * Memory Networks > Supervised Embeddings > LSTM ##### Joint Task * Supervised word embeddings perform very poorly even when using a large number of dimensions (2000 dimensions). * Memory Networks perform better than embedding models as they can utilise the local context and the long-term memory. But they do not perform as well on standalone QA tasks. |
[link]
Narodytska and Kasiviswanathan propose a local search-based black.box adversarial attack against deep networks. In particular, they address the problem of k-misclassification defined as follows: Definition (k-msiclassification). A neural network k-misclassifies an image if the true label is not among the k likeliest labels. To this end, they propose a local search algorithm which, in each round, randomly perturbs individual pixels in a local search area around the last perturbation. If a perturbed image satisfies the k-misclassificaiton condition, it is returned as adversarial perturbation. While the approach is very simple, it is applicable to black-box models where gradients and or internal representations are not accessible but only the final score/probability is available. Still the approach seems to be quite inefficient, taking up to one or more seconds to generate an adversarial example. Unfortunately, the authors do not discuss qualitative results and do not give examples of multiple adversarial examples (except for the four in Figure 1). https://i.imgur.com/RAjYlaQ.png Figure 1: Examples of adversarial attacks. Top: original image, bottom: perturbed image. |
[link]
Authors investigated why humans play some video games better than machines. That is the case for games that do not have continuous rewards (e.g., scores). They experimented with a game -- inspired by _Montezuma's Revenge_ -- in which the player has to climb stairs, collect keys and jump over enemies. RL algorithms can only know if they succeed if they finish the game, as there is no rewards during the gameplay, so they tend to do much worse than humans in these games. To compare between humans and machines, they set up RL algorithms and recruite players from Amazon Mechanical Turk. Humans did much better than machines for the original game setup. However, authors wanted to check the impact of semantics and prior knowledge on humans performance. They set up scenarios with different levels of reduced semantic information, as shown in Figure 2. https://i.imgur.com/e0Dq1WO.png This is what the game originally looked like: https://rach0012.github.io/humanRL_website/main.gif And this is the version with lesser semantic clues: https://rach0012.github.io/humanRL_website/random2.gif You can try yourself in the [paper's website](https://rach0012.github.io/humanRL_website/). Not surprisingly, humans took much more time to complete the game in scenarios with less semantic information, indicating that humans strongly rely on prior knowledge to play video games. The authors argue that this prior knowledge should also be somehow included into RL algorithms in order to move their efficiency towards the human level. ## Additional reading [Why humans learn faster than AI—for now](https://www.technologyreview.com/s/610434/why-humans-learn-faster-than-ai-for-now/). [OpenReview submission](https://openreview.net/forum?id=Hk91SGWR-) |