|
Welcome to ShortScience.org! |
|
|
[link]
This paper presents a neural network architecture that can take as input a question and a sequence of facts expressed in natural language (i.e. a sequence of words) and produce its output the answer to that question. The main components of the architecture are as follows:
* The question (q) and the facts (f_1, ... , f_K) are each individually transformed into a fixed size vector using the same GRU RNN (with the last hidden layer serving as the vector representation).
* These vectors are each passed through "reasoning layers", where each layer transforms the question q and the facts f_k into a new vector representation. This is done by feeding each question fact pair (q,f_k) to a neural network that outputs a new representation for the fact f_k (which replaces its old representation in the layer), as well as a new representation for the question. All K new question representations are then pooled to obtain a single question representation that replace the old one in the layer.
* The last reasoning layer is either fed to a softmax layer for binary questions, or to a scoring layer for questions with multiple and varying candidate answers.
This so-called Neural Reasoner can be trained by backpropagation, in an end-to-end, supervised way. The authors also suggest the use of auxiliary tasks, to improve results. The first ("original") adds an autoencoder reconstuction cost, that reproduces the question and facts from its first layer encoding. The second ("abstract") instead reconstructs a more abstract version of the sentences (e.g. "The triangle is above the pink rectangle." becomes "x is above y").
Importantly, while the Neural Reasoner framework is presented in this paper as covering many different variants, the version that is experimentally tested is one where the fact representations f_k are actually left unchanged throughout the reasoning layers, with only the question representation being changed.
The paper presents experiments on two synthetic reasoning tasks and report performances that compare favorably with previously published alternatives (based on the general Memory Network architecture). The experiments also show that the auxiliary tasks can substantially improve the performance of the model
#### My two cents
The proposed Neural Reasoner framework is actually very close to work published on arXiv at about the same time on End-to-End Memory Networks \cite{conf/nips/SukhbaatarSWF15}. In fact, the version tested in the paper, with unchanged fact representations throughout layers, is extremely close to End-to-End Memory Networks.
That said, there are also lots of differences. For instance, this paper proposes the use of multilayer networks within each Reasoning Layer, to produce updated question representations. In fact, experiments suggest that using several layers can be very beneficial for the path finding task. The sentence representation at the first layer is also different, being based on a non-linear RNN instead of being based on linear operations on embeddings as in Memory Networks.
The most interesting aspect of this paper to me is probably the demonstration that the use of an auxiliary task such as "original", which is unsupervised, can substantially improve the performance, again for the path finding task. That is, to me, probably the most exciting direction of future research that this paper highlights as promising.
I also liked how the model is presented. It didn't take me much time to understand the model, and I actually found it easier to absorb than the Memory Network model, despite both being very similar. I think this model is indeed a bit simpler than Memory Networks, which is a good thing. It also suggests a different approach to the problem, one where the facts representations are also updated during forward propagation, not just the question's representation (which is the version initially described in the paper... I hope experiments on that variant are eventually presented).
It's unfortunate that the authors only performed experiments on 2 of the 20 synthetic question-answering tasks. I hope a future version of this work can report results on the full benchmark and directly compare with End-to-End Memory Networks.
I was also unable to find out which of the question representation pooling mechanism (section 3.2.2) was used in the experiments. Perhaps the authors forgot to state it?
Overall, a pretty interesting paper that open different doors towards reasoning with neural networks.
![]() |
|
[link]
This paper focuses on the application of deep learning to the docking problem within rational drug design. The overall objective of drug design or discovery is to build predictive models of how well a candidate compound (or "ligand") will bind with a target protein, to help inform the decision of what compounds are promising enough to be worth testing in a wet lab. Protein binding prediction is important because many small-molecule drugs, which are designed to be small enough to get through cell membranes, act by binding to a specific protein within a disease pathway, and thus blocking that protein's mechanism. The formulation of the docking problem, as best I understand it, is: 1. A "docking program," which is generally some model based on physical and chemical interactions, takes in a (ligand, target protein) pair, searches over a space of ways the ligand could orient itself within the binding pocket of the protein (which way is it facing, where is it twisted, where does it interact with the protein, etc), and ranks them according to plausibility 2. A scoring function takes in the binding poses (otherwise known as binding modes) ranked the highest, and tries to predict the affinity strength of the resulting bond, or the binary of whether a bond is "active". The goal of this paper was to interpose modern machine learning into the second step, as alternative scoring functions to be applied after the pose generation . Given the complex data structure that is a highly-ranked binding pose, the hope was that deep learning would facilitate learning from such a complicated raw data structure, rather than requiring hand-summarized features. They also tested a similar model structure on the problem of predicting whether a highly ranked binding pose was actually the empirically correct one, as determined by some epsilon ball around the spatial coordinates of the true binding pose. Both of these were binary tasks, which I understand to be 1. Does this ranked binding pose in this protein have sufficiently high binding affinity to be "active"? This is known as the "virtual screening" task, because it's the relevant task if you want to screen compounds in silico, or virtually, before doing wet lab testing. 2. Is this ranked binding pose the one that would actually be empirically observed? This is known as the "binding mode prediction" task The goal of this second task was to better understand biases the researchers suspected existed in the underlying dataset, which I'll explain later in this post. The researchers used a graph convolution architecture. At a (very) high level, graph convolution works in a way similar to normal convolution - in that it captures hierarchies of local patterns, in ways that gradually expand to have visibility over larger areas of the input data. The distinction is that normal convolution defines kernels over a fixed set of nearby spatial coordinates, in a context where direction (the pixel on top vs the pixel on bottom, etc) is meaningful, because photos have meaningful direction and orientation. By contrast, in a graph, there is no "up" or "down", and a given node doesn't have a fixed number of neighbors (whereas a fixed pixel in 2D space does), so neighbor-summarization kernels have to be defined in ways that allow you to aggregate information from 1) an arbitrary number of neighbors, in 2) a manner that is agnostic to orientation. Graph convolutions are useful in this problem because both the summary of the ligand itself, and the summary of the interaction of the posed ligand with the protein, can be summarized in terms of graphs of chemical bonds or interaction sites. Using this as an architectural foundation, the authors test both solo versions and ensembles of networks: https://i.imgur.com/Oc2LACW.png 1. "L" - A network that uses graph convolution to summarize the ligand itself, with no reference to the protein it's being tested for binding affinity with 2. "LP" - A network that uses graph convolution on the interaction points between the ligand and protein under the binding pose currently being scored or predicted 3. "R" - A simple network that takes into account the rank assigned to the binding pose by the original docking program (generally used in combination with one of the above). The authors came to a few interesting findings by trying different combinations of the above model modules. First, they found evidence supporting an earlier claim that, in the dataset being used for training, there was a bias in the positive and negative samples chosen such that you could predict activity of a ligand/protein binding using *ligand information alone.* This shouldn't be possible if we were sampling in an unbiased way over possible ligand/protein pairs, since even ligands that are quite effective with one protein will fail to bind with another, and it shouldn't be informationally possible to distinguish the two cases without protein information. Furthermore, a random forest on hand-designed features was able to perform comparably to deep learning, suggesting that only simple features are necessary to perform the task on this (bias and thus over-simplified) Specifically, they found that L+LP models did no better than models of L alone on the virtual screening task. However, the binding mode prediction task offered an interesting contrast, in that, on this task, it's impossible to predict the output from ligand information alone, because by construction each ligand will have some set of binding modes that are not the empirically correct one, and one that is, and you can't distinguish between these based on ligand information alone, without looking at the actual protein relationship under consideration. In this case, the LP network did quite well, suggesting that deep learning is able to learn from ligand-protein interactions when it's incentivized to do so. Interestingly, the authors were only able to improve on the baseline model by incorporating the rank output by the original docking program, which you can think of an ensemble of sorts between the docking program and the machine learning model. Overall, the authors' takeaways from this paper were that (1) we need to be more careful about constructing datasets, so as to not leak information through biases, and (2) that graph convolutional models are able to perform well, but (3) seem to be capturing different things than physics-based models, since ensembling the two together provides marginal value. ![]() |
|
[link]
Layer-wise Relevance Propagation (LRP) is a novel technique has been used by authors in multiple use-cases (apart from this publication) to demonstrate the robustness and advantage of a *decomposition* method over other heatmap generation methods. Such heatmap generation methods are very crucial for increasing interpretability of Deep Learning models as such. Apart from LRP relevance, authors also discuss quantitative ways to measure the accuracy of the heatmap generated. ### LRP & Alternatives What is LRP ? LRP is a principled approach to decompose a classification decision into pixel-wise relevances indicating the contributions of a pixel to the overall classification score. The approach is derived from a layer-wise conservation principle , which forces the propagated quantity (e.g. evidence for a predicted class) to be preserved between neurons of two adjacent layers. Denoting by R(l) [i] the relevance associated to the ith neuron of layer and by R (l+1) [j] the relevance associated to the jth neuron in the next layer, the conservation principle requires that  where R(l) [i] is given as  where z[i,j] is the activation of jth neuron because of input from ith neuron As per authors this is not necssarily the only relevance funtion which is conserved. The intuition behind using such a function is that lower-layer neurons that mostly contribute to the activation of the higher-layer neuron receive a larger share of the relevance Rj of the neuron j. A downside of this propagation rule (at least if *epsilon* = 0) is that the denominator may tend to zero if lower-level contributions to neuron j cancel each other out. The numerical instability can be overcome by setting *epsilon* > 0. However in that case, the conservation idea is relaxated in order to gain better numerical properties. To conserve relevance, it can be formulated as sum of positive and negative activations  such that *alpha* - *beta* = 1 #### Alternatives to LRP for heatmap **Senstiivity measurement** In such methods of generating heamaps, gradient of the output with respect to input is used for generating heatmap. This quantity measures how much small changes in the pixel value locally affect the network output. ##### Disadvantages Given most models use ReLU as activation function, the gradient flows only through activation with positive output - thereby making makes the backward mapping discontinuous, and consequently strongly local. Also same applies for maxpool activations - wherein gradients only flow through neurons with maximum intensity in local neighbourhood. Also, given most of these methods use absolute impact on prediction cause by changes in pixel intensities, the granularity of whether the pixel intensity was in favour or against evidence is lost. **Deconvolutional Networks** ##### Disadvantages Here the backward discontinuity problem of sensitivity based methods are absent, hence global features can be captured. However, since the method only takes in activation from final layer (which learns the presence or absence of features mostly) , using this for generating heatmaps is likely to yield avergae maps, lacking image specific localisation effects LRP is able to counter the effects nicely because of the way it uses relevance #### Performance of heatmaps Few concerns that the authors raise are - A heatmap is not a segmentation mask on the contrary missing evidence or the context may be very important for classification - Salient features represent average explanations of what distinguishes one image category from another. For individual images these explanations may be meaningless or even wrong. For instance, salient features for the class ‘bicycle’ may be the wheels and the handlebar. However, in some images a bicycle may be partly occluded so that these parts of a bike are not visible. In these images salient features fail to explain the classifier’s decision (which still may be correct). Authors propose a novel method (MoRF - *Most Relevant First* ) of objectively quantifying quality of a heatmap. A good detailed idea of the measure can best be obtained from the paper. To give an idea, the most reliable method should ideally rank the most relevant regions in the same order even if small perturbations in pixel intensities are observed (in non-relevant areas. The quantity of interest in this case is the area over the MoRF perturbation curve (AOPC). #### Observation Most of the sensitivity based methods answer to the question - *what change would make the image more or less belong to the category car* which isn't really the classifier's question. LRP plans to answer the real classifier question *what speaks for the presence of a car in the image* An image below would be a good example of how LRPs can denoise heatmaps generated on the basis of sensitivity.  ![]() |
|
[link]
In the past year or so, contrastive learning has experienced widespread success, and has risen to be a dominant problem framing within self-supervised learning. The basic idea of contrastive learning is that, instead of needing human-generated labels to generate a supervised task, you instead assume that there exists some automated operation you can perform to a data element to generate another data element that, while different, should be considered still fundamentally the same, or at least more strongly related, and that you can contrast these related pairs against pairs constructed with the rest of the dataset, with which any given frame would not by default have this assumed relationship of sameness or quasi-similarity. One fairly central way that "different but still effectively similar" has been historically defined - at least within the realm of image-based models - is through the use of data augmentations: image transformations such as cropping, color jitter, or Gaussian blur, which are used on an image to create the counterpart in its related pair. Fundamentally, what we're doing when we define these particular augmentations is saying: these transformations don't cause a meaningful change in what the image is, and so we want the representations we get with and without the transformations to be close to one another (or at least to contain enough information to predict one another). Another way you can say this is that we're defining properties of the image that we want our representation to be invariant to. The authors of this paper make the point that, when aggressive cropping is part of your toolkit of augmentations, the crops of the image can actually contain meaningfully different content than the uncropped image. If you remove a few pixels around the edges of an image, it still fundamentally contains the same things. However, if you zoom in dramatically, you may get crops that contain different objects. From an image classification standpoint, you would expect that coding in an invariance to cropping in our representations would, in some cases, also mean coding in an invariance to object type, which would presumably be detrimental to the task of classifying objects. To explain the extent of the success that aggressive-cropping methods have had so far, they argue that ImageNet has the particular property that its images are curated to be primarily and centrally containing a single object at a time, such that, even if you zoom in, you're getting a part of the central object, rather than another object entirely. They argue that this dataset bias might explain why you haven't seen as much of this object-invariance be a problem in earlier augmentation-based contrastive work. To try to test this, they train different contrastive (MoCo v2) models on the MSCOCO dataset, which consists of pictures of rooms, and thus no longer has the property of being centrally of one object. They tried one setting where they performed contrastive loss on the images as a whole, and another where the input to the augmentation pipeline were images from the same dataset, but pre-cropped to only contain one image at a time. This was meant, as far as I can tell, to isolate the effect of "object-centric vs not" while holding other dataset factors constant. They then test how well these different models do on an object-centric classification task (Pascal Cropped Boxes). They find that the contrastive model that trains cropped versions of the dataset gets about 3.5 points higher mean accuracy (71.9 vs 75.3) compared to the contrastive loss done on the multi-object versions of images. They also explicitly try to measure different forms of invariance, through a scheme where they binarize the elements of the representation vector, and calculate what proportion of them fire on average with and without a given set of transformations. They find that the main form of invariances that contrastive learning does well at is invariance to occlusion (part of the image not being visible), where both contrastive methods give ~84 percent co-firing, and supervised pre-training only gets about 80.9. However, on other important measures of invariance - viewpoint, illumination direction, and instance (that is, specific instance within a class) - contrastive representations perform notably worse than supervised pretraining. https://i.imgur.com/7Ghbv5A.png To try to solve these two problems, they propose a method that learns from video, and that uses temporally separated frames (which are then augmented) as pairs. They call this Frame Temporal Invariance, and argue, reasonably, that by pushing the representations of adjacent frames which track a (presumably consistent, or at least slowly-evolving) scene closer together, you should expect better invariance to viewpoint change and image deformation, since those things naturally happen when an object is moving through the world. They also suggest using an off-the-shelf object bounding box model to find particular objects, and track them throughout the video, and to use contrastive learning specifically on the bounding boxes that the algorithm thinks track a consistent object. https://i.imgur.com/2GfCTog.png Overall, my take on this paper is that the analysis they do - of different kinds of invariances contrastive vs supervised loss does well on, and of the extent to which contrastive loss results might be biased by datasets - is quite interesting and a valuable contribution to our understanding of these very currently-hypey algorithms. However, I'm a bit less impressed by the novelty of their proposed solution. Temporal forms of contrastive learning have been around before - in reinforcement learning, and even in the original Contrastive Predictive Coding paper, where the related pairs were related by dint of temporal closeness. So, while using it in video is certainly a good idea, it doesn't really feel strongly novel to me. I also feel a little confused by their choice of using an off-the-shelf object detection model as a pre-requisite for a self-supervised task, since my impression was that a central goal of self-supervision was building techniques that could scale to situations where it was infeasible to get large amounts of labels, and any method that relies on a pre-existing trained object bounding box model is pretty inherently limited in that regard. ![]() |
|
[link]
* They describe a method that applies the style of a source image to a target image.
* Example: Let a normal photo look like a van Gogh painting.
* Example: Let a normal car look more like a specific luxury car.
* Their method builds upon the well known artistic style paper and uses a new MRF prior.
* The prior leads to locally more plausible patterns (e.g. less artifacts).
### How
* They reuse the content loss from the artistic style paper.
* The content loss was calculated by feed the source and target image through a network (here: VGG19) and then estimating the squared error of the euclidean distance between one or more hidden layer activations.
* They use layer `relu4_2` for the distance measurement.
* They replace the original style loss with a MRF based style loss.
* Step 1: Extract from the source image `k x k` sized overlapping patches.
* Step 2: Perform step (1) analogously for the target image.
* Step 3: Feed the source image patches through a pretrained network (here: VGG19) and select the representations `r_s` from specific hidden layers (here: `relu3_1`, `relu4_1`).
* Step 4: Perform step (3) analogously for the target image. (Result: `r_t`)
* Step 5: For each patch of `r_s` find the best matching patch in `r_t` (based on normalized cross correlation).
* Step 6: Calculate the sum of squared errors (based on euclidean distances) of each patch in `r_s` and its best match (according to step 5).
* They add a regularizer loss.
* The loss encourages smooth transitions in the synthesized image (i.e. few edges, corners).
* It is based on the raw pixel values of the last synthesized image.
* For each pixel in the synthesized image, they calculate the squared x-gradient and the squared y-gradient and then add both.
* They use the sum of all those values as their loss (i.e. `regularizer loss = <sum over all pixels> x-gradient^2 + y-gradient^2`).
* Their whole optimization problem is then roughly `image = argmin_image MRF-style-loss + alpha1 * content-loss + alpha2 * regularizer-loss`.
* In practice, they start their synthesis with a low resolution image and then progressively increase the resolution (each time performing some iterations of optimization).
* In practice, they sample patches from the style image under several different rotations and scalings.
### Results
* In comparison to the original artistic style paper:
* Less artifacts.
* Their method tends to preserve style better, but content worse.
* Can handle photorealistic style transfer better, so long as the images are similar enough. If no good matches between patches can be found, their method performs worse.

*Non-photorealistic example images. Their method vs. the one from the original artistic style paper.*

*Photorealistic example images. Their method vs. the one from the original artistic style paper.*
![]() |