|
Welcome to ShortScience.org! |
|
|
[link]
This paper deals with an important problem where a deep classification system is made explainable. After the (continuing) success of Deep Networks, researchers are trying to open the blackbox and this work is one of the foremosts. The authors explored the strength of a deep learning method (vision-language model) to explain the performance of another deep learning model (image classification). The approach jointly predicts a class label and explains why it predicted so in natural language. The paper starts with a very important differentiation between two basic schools of *explnation* systems - the *introspection* explanation system and the *justification* explanation system. The introspection system looks into the model to get an explanation (e.g., "This is a Western Grebe because filter 2 has a high activation..."). On the other hand, a justification system justifies the decision by producing sentence details on how visual evidence is compatible with the system output (e.g., "This is a Western Grebe because it has red eyes..."). The paper focuses on *justification* explanation system and proposes a novel one. The authors argue that unlike a description of an image or a sentence defining a class (not necessarily in presence of an image), visual explanation, conditioned on an input image, provides much more of an explanatory text on why the image is classified as a certain category mentioning only image relevant features. The broad outline of the approach is given in Fig (2) of the paper. https://i.imgur.com/tta2qDp.png The first stage consists of a deep convolutional network for classification which generates a softmax distribution over the classes. As the task handles fine-grained bird species classification, it uses a compact bilinear feature representation known to work well for the fine-grained classification tasks. The second stage is a stacked LSTM which generates natural language sentences or explanations justifying the decision of the first stage. The first LSTM of the stack receives the previously generated word. The second LSTM receives the output of the first LSTM along with image features and predicted label distribution from the classification network. This LSTM produces the sequence of output words until an "end-of-sentence" token is generated. The intuition behind using predicted label distribution for explanation is that it would inform the explanation generation model which words and attributes are more likely to occur in the description. Two kinds of losses are used for the second stage *i.e.*, the language model. The first one is termed as the *Relevance Loss* which is the typical sentence generation loss that is seen in literature. This is the sum of cross-entropy losses of the generated words with respect to the ground truth words. Its role is to optimize the alignment between generated and ground truth sentences. However, this loss is not very effective in producing sentences which include class discriminative information. class specificity is a global sentence property. This is illustrated with the following example - *whereas a sentence "This is an all black bird with a bright red eye" is class specific to a "Bronzed Cowbird", words and phrases in the sentence, such as "black" or "red eye" are less class discriminative on their own.* As a result, cross entropy loss on individual words turns out to be less effective in capturing the global sentence property of which class specifity is an example. The authors address this issue by proposing an addiitonal loss, termed as the *Discriminative Loss* which is based on a reinforcement learning paradigm. Before computing the loss, a sentence is sampled. The sentence is passed through a LSTM-based classification network whose task is to produce the ground truth category $C$ given only the sampled sentence. The reward for this operation is simply the probability of the ground truth category $C$ given only the sentence. The intuition is - for the model to produce an output with a large reward, the generated sentence must include enough information to classify the original image properly. The *Discriminative Loss* is the expectation of the negative of this reward and a wieghted linear combination of the two losses is optimized during training. My experience in reinforcement learning is limited. However, I must say I did not quite get why is sampling of the sentences required (which called for the special algorithm for backpropagation). If the idea is to see whether a generated sentence can be used to get at the ground truth category, could the last internal state of one of the stacked LSTM not be used? It would have been better to get some more intution behind the sampling operation. Another thing which (is fairly obvious but still I felt) is missing is not mentioning the loss used in the fine grained classification network. The experimentation is rigorous. The proposed method is compared with four different baseline and ablation models - description, definition, explanation-label, explanation-discriminative with different permutation and combinations of the presence of two types losses, class precition informations etc. Also the evaluation metrics measure different qualities of the generated exlanations, specifically image and class relevances. To measure image relevance METEOR/CIDEr scores of the generated sentences with the ground truth (image based) explanations are computed. On the other hand, to measure the class relevance, CIDEr scores with class definition (not necessarily based on the images from the dataset) sentences are computed. The proposed approach has continuously shown better performance than any of the baseline or ablation methods. I'd specifically mention about one experiment where the effect of class conditioning is studies (end of Sec 5.2). The finding is quite interesting as it shows that providing or not providing correct class information has drastic effect at the generated explanations. It is seen that giving incorrect class information makes the explanation model hallucinate colors or attributes which are not present in the image but are specific to the class. This raises the question whether it is worth giving the class information when the classifier is poor on the first hand? But, I think the answer lies in the observation that row 5 (with class prediction information) in table 1 is always better than row 4 (no class prediction information). Since, row 5 is better than row 4, this means the classifier is also reasonable and this in turn implies that end-to-end training can improve all the stages of a pipeline which ultimately improves the overall performance of the system too! In summary, the paper is a very good first step to explain intelligent systems and should encourage a lot more effort in this direction. ![]() |
|
[link]
#### Introduction
* The paper explores the strengths and weaknesses of different evaluation metrics for end-to-end dialogue systems(in unsupervised setting).
* [Link to the paper](https://arxiv.org/abs/1603.08023)
#### Evaluation Metrics Considered
##### Word Based Similarity Metric
###### BLEU
* Analyses the co-occurrences of n-grams in the ground truth and the proposed responses.
* BLEU-N: N-gram precision for the entire dataset.
* Brevity penalty added to avoid bias towards short sentences.
###### METEOR
* Create explicit alignment between candidate and target response (using Wordnet, stemmed token etc).
* Compute the harmonic mean of precision and recall between proposed and ground truth.
###### ROGUE
* F-measure based on Longest Common Subsequence (LCS) between candidate and target response.
##### Embedding Based Metric
###### Greedy Matching
* Each token in actual response is greedily matched with each token in predicted response based on cosine similarity of word embedding (and vice-versa).
* Total score is averaged over all words.
###### Embedding Average
* Calculate sentence level embedding by averaging word level embeddings
* Compare sentence level embeddings between candidate and target sentences.
###### Vector Extrema
* For each dimension in the word vector, take the most extreme value amongst all word vectors in the sentence, and use
that value in the sentence-level embedding.
* Idea is that by taking the maxima along each dimension, we can ignore the common words (which will be pulled towards the origin in the vector space).
#### Dialogue Models Considered
##### Retrieval Models
###### TF-IDF
* Compute the TF-IDF vectors for each context and response in the corpus.
* C-TFIDF computes the cosine similarity between an input context and all other contexts in the corpus and returns the response with the highest score.
* R-TFIDF computes the cosine similarity between the input context and each response directly.
###### Dual Encoder
* Two RNNs which respectively compute the vector representation of the input context and response.
* Then calculate the probability that given response is the ground truth response given the context.
##### Generative Models
###### LSTM language model
* LSTM model trained to predict the next word in the (context, response) pair.
* Given a context, model encodes it with the LSTM and generates a response using a greedy beam search procedure.
###### Hierarchical Recurrent Encoder-Decoder (HRED)
* Uses a hierarchy of encoders.
* Each utterance in the context passes through an ‘utterance-level’ encoder and the output of these encoders is passed through another 'context-level' decoder.
* Better handling of long-term dependencies as compared to the conventional Encoder-Decoder.
#### Observations
* Human survey to determine the correlation between human judgement on the quality of responses, and the score assigned by each metric.
* Metrics (especially BLEU-4 and BLEU-3) correlate poorly with human evaluation.
* Best performing metric:
* Using word-overlaps - BLEU-2 score
* Using word embeddings - vector average
* Embedding-based metrics would benefit from a weighting of word saliency.
* BLEU could still be a good evaluation metric in constrained tasks like mapping dialogue acts to natural language sentences.
![]() |
|
[link]
* DCGANs are just a different architecture of GANs.
* In GANs a Generator network (G) generates images. A discriminator network (D) learns to differentiate between real images from the training set and images generated by G.
* DCGANs basically convert the laplacian pyramid technique (many pairs of G and D to progressively upscale an image) to a single pair of G and D.
### How
* Their D: Convolutional networks. No linear layers. No pooling, instead strided layers. LeakyReLUs.
* Their G: Starts with 100d noise vector. Generates with linear layers 1024x4x4 values. Then uses fractionally strided convolutions (move by 0.5 per step) to upscale to 512x8x8. This is continued till Cx32x32 or Cx64x64. The last layer is a convolution to 3x32x32/3x64x64 (Tanh activation).
* The fractionally strided convolutions do basically the same as the progressive upscaling in the laplacian pyramid. So it's basically one laplacian pyramid in a single network and all upscalers are trained jointly leading to higher quality images.
* They use Adam as their optimizer. To decrease instability issues they decreased the learning rate to 0.0002 (from 0.001) and the momentum/beta1 to 0.5 (from 0.9).

*Architecture of G using fractionally strided convolutions to progressively upscale the image.*
### Results
* High quality images. Still with distortions and errors, but at first glance they look realistic.
* Smooth interpolations between generated images are possible (by interpolating between the noise vectors and feeding these interpolations into G).
* The features extracted by D seem to have some potential for unsupervised learning.
* There seems to be some potential for vector arithmetics (using the initial noise vectors) similar to the vector arithmetics with wordvectors. E.g. to generate mean with sunglasses via `vector(men) + vector(sunglasses)`.
")
*Generated images, bedrooms.*
")
*Generated images, faces.*
### Rough chapter-wise notes
* Introduction
* For unsupervised learning, they propose to use to train a GAN and then reuse the weights of D.
* GANs have traditionally been hard to train.
* Approach and model architecture
* They use for D an convnet without linear layers, withput pooling layers (only strides), LeakyReLUs and Batch Normalization.
* They use for G ReLUs (hidden layers) and Tanh (output).
* Details of adversarial training
* They trained on LSUN, Imagenet-1k and a custom dataset of faces.
* Minibatch size was 128.
* LeakyReLU alpha 0.2.
* They used Adam with a learning rate of 0.0002 and momentum of 0.5.
* They note that a higher momentum lead to oscillations.
* LSUN
* 3M images of bedrooms.
* They use an autoencoder based technique to filter out 0.25M near duplicate images.
* Faces
* They downloaded 3M images of 10k people.
* They extracted 350k faces with OpenCV.
* Empirical validation of DCGANs capabilities
* Classifying CIFAR-10 GANs as a feature extractor
* They train a pair of G and D on Imagenet-1k.
* D's top layer has `512*4*4` features.
* They train an SVM on these features to classify the images of CIFAR-10.
* They achieve a score of 82.8%, better than unsupervised K-Means based methods, but worse than Exemplar CNNs.
* Classifying SVHN digits using GANs as a feature extractor
* They reuse the same pipeline (D trained on CIFAR-10, SVM) for the StreetView House Numbers dataset.
* They use 1000 SVHN images (with the features from D) to train the SVM.
* They achieve 22.48% test error.
* Investigating and visualizing the internals of the networks
* Walking in the latent space
* The performs walks in the latent space (= interpolate between input noise vectors and generate several images for the interpolation).
* They argue that this might be a good way to detect overfitting/memorizations as those might lead to very sudden (not smooth) transitions.
* Visualizing the discriminator features
* They use guided backpropagation to visualize what the feature maps in D have learned (i.e. to which images they react).
* They can show that their LSUN-bedroom GAN seems to have learned in an unsupervised way what beds and windows look like.
* Forgetting to draw certain objects
* They manually annotated the locations of objects in some generated bedroom images.
* Based on these annotations they estimated which feature maps were mostly responsible for generating the objects.
* They deactivated these feature maps and regenerated the images.
* That decreased the appearance of these objects. It's however not as easy as one feature map deactivation leading to one object disappearing. They deactivated quite a lot of feature maps (200) and they objects were often still quite visible or replaced by artefacts/errors.
* Vector arithmetic on face samples
* Wordvectors can be used to perform semantic arithmetic (e.g. `king - man + woman = queen`).
* The unsupervised representations seem to be useable in a similar fashion.
* E.g. they generated images via G. They then picked several images that showed men with glasses and averaged these image's noise vectors. They did with same with men without glasses and women without glasses. Then they performed on these vectors `men with glasses - mean without glasses + women without glasses` to get `woman with glasses
![]() |
|
[link]
This work expands on prior techniques for designing models that can both be stored using fewer parameters, and also execute using fewer operations and less memory, both of which are key desiderata for having trained machine learning models be usable on phones and other personal devices. The main contribution of the original MobileNets paper was to introduce the idea of using "factored" decompositions of Depthwise and Pointwise convolutions, which separate the procedures of "pull information from a spatial range" and "mix information across channels" into two distinct steps. In this paper, they continue to use this basic Depthwise infrastructure, but also add a new design element: the inverted-residual linear bottleneck. The reasoning behind this new layer type comes from the observation that, often, the set of relevant points in a high-dimensional space (such as the 'per-pixel' activations inside a conv net) actually lives on a lower-dimensional manifold. So, theoretically, and naively, one could just try to use lower dimensional internal representations to map the dimensionality of that assumed manifold. However, the authors argue that ReLU non-linearities kill information (because of the region where all inputs are mapped to zero), and so having layers contain only the number of dimensions needed for the manifold would mean that you end up with too-few dimensions after the ReLU information loss. However, you need to have non-linearities somewhere in the network in order to be able to learn complex, non-linear functions. So, the authors suggest a method to mostly use smaller-dimensional representations internally, but still maintain ReLus and the network's needed complexity. https://i.imgur.com/pN4d9Wi.png - A lower-dimensional output is "projected up" into a higher dimensional output - A ReLu is applied on this higher-dimensional layer - That layer is then projected down into a smaller-dimensional layer, which uses a linear activation to avoid information loss - A residual connection between the lower-dimensional output at the beginning and end of the expansion This way, we still maintain the network's non-linearity, but also replace some of the network's higher-dimensional layers with lower-dimensional linear ones ![]() |
|
[link]
I admit it - the title of the paper pulled me in, existing as it does in the chain of weirdly insider-meme papers, starting with Vaswani’s 2017 “Attention Is All You Need”. That paper has been hugely influential, and the domain of machine translation as a whole has begun to move away from processing (or encoding) source sentences with recurrent architectures, to instead processing them using self-attention architectures. (Self-attention is a little too nuanced to go into in full depth here, but the basic idea is: instead of summarizing varying-length sequences by feeding each timestep into a recurrent loop and building up hidden states, generate a query, and weight the contribution of each timestep to each “hidden state” based on the dot product between that query and each timestep’s representation). There has been an overall move in recent years away from recurrence being the accepted default for sequence data, and towards attention and (often dilated) convolution taking up more space. I find this an interesting set of developments, and had hopes that this paper would address that arc. However, unfortunately, the title was quite out of sync with the actual focus of the paper - instead of addressing the contribution of attention mechanisms vs recurrence, or even directly addressing any of the particular ideas posed in the “Attention is All You Need” paper, this YMNNA instead takes aim at a more fundamental structural feature of translation models: the encoder/decoder structure. The basic idea of an encoder/decoder approach, in a translation paradigm, is that you process the entire source sentence before you start generating the tokens of the predicted, other-language target sentence. Initially, this would work by running a RNN over the full sentence, and using the final hidden state of that RNN as a compressed representation of the full sentence. More recently, the norm has been to use multiple layers of RNN, and to represent the source sentence via the hidden states at each timestep (so: as many hidden states as you have input tokens), and then at each step in the decoding process, calculate an attention-weighted average over all of those hidden states. But, fundamentally, both of these structures share the fact that some kind of global representation is calculated and made available to the decoder before it starts predicting words in the output sentence. This makes sense for a few reasons. First, and most obviously, languages aren’t naturally aligned with one another, in the sense of one word in language X corresponding to one word in language Y. It’s not possible for you to predict a word in the target sentence if its corresponding source sentence token has not yet been processed. For another, there can be contextual information from the sentence as a whole that can disambiguate between different senses of a word, which may have different translations - think Teddy Bear vs Teddy Roosevelt. However, this paper poses the question: how well can you do if you throw away this structure, and build a model that continually emits tokens of the target sequence as it reads in the source sentence? Using a recurrent model, the YMNNA model takes, at each timestep, the new source token, the previous target token, and the prior hidden state from the last time step of the RNN, and uses that to predict a token. However, that problem mentioned earlier - of languages not natively being aligned such that you have the necessary information to predict a word by the time you get to its point in the target sequence - hasn’t gone away, and is still alive and kicking. This paper solves it in a pretty unsatisfying way - by relying on an external tool, fast-align, that does the work of guessing which source tokens correspond to which target tokens, and inserting buffer tokens into the target, so that you don’t need to predict a word until it’s already been seen by the source-reading RNN; until then you just predict the buffer. This is fine and clever as a practical heuristic, but it really does make their comparisons against models that do alignment and translation jointly feel a little weak. https://i.imgur.com/Gitpxi7.png An additional heuristic that makes the overall narrative of the paper less compelling is the fact that, in order to get comparable performance to their baselines, they padded the target sequences with between 3 and 5 buffer tokens, meaning that the models learned that they could process the first 3-5 tokens of the sentence before they need to start emitting the target. Again, there’s nothing necessarily wrong with this, but, since they are consuming a portion of the sentence before they start emitting translations, it does make for a less stark comparison with the “read the whole sentence” encoder/decoder framework. A few other frustrations, and notes from the paper’s results section: As earlier mentioned, the authors don’t actually compare their work against the “Attention is All You Need” paper, but instead to a 2014 paper. This is confusing both in terms of using an old baseline for SOTA, and also in terms of their title implicitly arguing they are refuting a paper they didn’t compare to Comparing against their old baseline, their eager translation model performs worse on all sentences less than 60 tokens in length (which makes up the vast majority of all the sentences there are), and only beats the baseline on sentences > 60 tokens in length Additionally, they note as a sort of throwaway line that their model took almost three times as long to train as the baseline, with the same amount of parameters, simply because it took so much longer to converge. Being charitable, it seems like there is some argument that an eager translation framework performs well on long sentences, and can do so while only keeping a hidden state in memory, rather than having to keep the hidden states for each source sequence element around, like attention-based decoders require. However, overall, I found this paper to be a frustrating let-down, that used too many heuristics and hacks to be a compelling comparison to prior work. ![]()
1 Comments
|