Welcome to ShortScience.org! 
[link]
Largescale transformers on unsupervised text data have been wildly successful in recent years; arguably, the most successful single idea in the last ~3 years of machine learning. Given that, it's understandable that different domains within ML want to take their shot at seeing whether the same formula will work for them as well. This paper applies the principles of (1) transformers and (2) largescale unlabeled data to the problem of learning informative embeddings of molecular graphs. Labeling is a problem in much of machine learning  it's costly, and narrowly defined in terms of a certain task  but that problem is even more exacerbated when it comes to labeling properties of molecules, since they typically require wetlab chemistry to empirically measure. Given that, and also given the fact that we often want to predict new properties  like effectiveness against a new targetable drug receptor  that we don't yet have data for, finding a way to learn and transfer from unsupervised data has the potential to be quite valuable in the molecular learning sphere. There are two main conceptual parts to this paper and its method  named GROVER, in truetoMLform tortured acronym style. The first is the actual architecture of their model itself, which combines both a messagepassing Graph Neural Network to aggregate local information, and a Transformer to aggregate global information. The paper was a bit vague here, but the way I understand it is: https://i.imgur.com/JY4vRdd.png  There are parallel GNN + Transformer stacks for both edges and nodes, each of which outputs both a node and edge embedding, for four embeddings total. I'll describe the one for nodes, and the parallel for edges operates the same way, except that hidden states live on edges rather than nodes, and attention is conducted over edges rather than nodes  In the NodeTransformer version, a message passing NN (of I'm not sure how many layers) performs neighborhood aggregation (aggregating the hidden states of neighboring nodes and edges, then weighttransforming them, then aggregating again) until each node has a representation that has "absorbed" in information from a few hops out of its surrounding neighborhood. My understanding is that there is a separate MPNN for queries, keys, and values, and so each nodes end up with three different vectors for these three things.  Multiheaded attention is then performed over these node representations, in the normal way, where all keys and queries are dotproducted together, and put into a softmax to calculate a weighted average over the values  We now have nodelevel representations that combine both local and global information. These node representations are then aggregated into both node and edge representations, and each is put into a MLP layer and Layer Norm before finally outputting a nodebased node and edge representation. This is then joined by an edgebased node and edge representation from the parallel stack. These are aggregated on a fullgraph level to predict graphlevel properties https://i.imgur.com/NNl6v4Y.png The other component of the GROVER model is the way this architecture is actually trained  without explicit supervised labels. The authors use two tasks  one local, and one global. The local task constructs labels based on local contextual properties of a given atom  for example, the atom here has one doublebonded Nitrogen and one singlebonded Oxygen in its local environment  and tries to predict those labels given the representations of that atom (or node). The global task uses RDKit (an analytically constructed molecular analysis kit) to identify 85 different modifs or functional groups in the molecule, and encodes those into an 85long onehot vector that is being predicted on a graph level. https://i.imgur.com/jzbYchA.png With these two components, GROVER is pretrained on 10 million unlabeled molecules, and then evaluated in transfer settings where its representations are finetuned on small amounts of labeled data. The results are pretty impressive  it achieves new SOTA performance by relatively large amounts on all tasks, even relative to exist semisupervised pretraining methods that similarly have access to more data. The authors perform ablations to show that it's important to do the graphaggregation step before a transformer (the alternative being just doing a transformer on raw node and edge features), and also show that their architecture without pretraining (just used directly in downstream tasks) also performs worse. One thing I wish they'd directly ablated was the valueadd of the local (also referred to as "contextual") and global semisupervised tasks. Naively, I'd guess that most of the performance gain came from the global task, but it's hard to know without them having done the test directly. 
[link]
This paper deals with the formal question of machine reading. It proposes a novel methodology for automatic dataset building for machine reading model evaluation. To do so, the authors leverage on news resources that are equipped with a summary to generate a large number of questions about articles by replacing the named entities of it. Furthermore a attention enhanced LSTM inspired reading model is proposed and evaluated. The paper is wellwritten and clear, the originality seems to lie on two aspects. First, an original methodology of question answering dataset creation, where contextqueryanswer triples are automatically extracted from news feeds. Such proposition can be considered as important because it opens the way for large model learning and evaluation. The second contribution is the addition of an attention mechanism to an LSTM reading model. the empirical results seem to show relevant improvement with respect to an uptodate list of machine reading models. Given the lack of an appropriate dataset, the author provides a new dataset which scraped CNN and Daily Mail, using both the full text and abstract summaries/bullet points. The dataset was then anonymised (i.e. entity names removed). Next the author presents a two novel Deep longshort term memory models which perform well on the Cloze query task. 
[link]
# Overview This paper presents a novel way to align frames in videos of similar actions temporally in a selfsupervised setting. To do so, they leverage the concept of cycleconsistency. They introduce two formulations of cycleconsistency which are differentiable and solvable using standard gradient descent approaches. They name their method Temporal Cycle Consistency (TCC). They introduce a dataset that they use to evaluate their approach and show that their learned embeddings allow for few shot classification of actions from videos. Figure 1 shows the basic idea of what the paper aims to achieve. Given two video sequences, they wish to map the frames that are closest to each other in both sequences. The beauty here is that this "closeness" measure is defined by nearest neighbors in an embedding space, so the network has to figure out for itself what being close to another frame means. The cycleconsistency is what makes the network converge towards meaningful "closeness". ![image](https://userimages.githubusercontent.com/18450628/6388819068b8a500c9ac11e99da7925b72c731c3.png) # Cycle Consistency Intuitively, the concept of cycleconsistency can be thought of like this: suppose you have an application that allows you to take the photo of a user X and increase their apparent age via some transformation F and decrease their age via some transformation G. The process is cycleconsistent if you can age your image, then using the aged image, "deage" it and obtain something close to the original image you took; i.e. F(G(X)) ~= X. In this paper, cycleconsistency is defined in the context of nearestneighbors in the embedding space. Suppose you have two video sequences, U and V. Take a frame embedding from U, u_i, and find its nearest neighbor in V's embedding space, v_j. Now take the frame embedding v_j and find its closest frame embedding in U, u_k, using a nearestneighbors approach. If k=i, the frames are cycle consistent. Otherwise, they are not. The authors seek to maximize cycle consistency across frames. # Differentiable Cycle Consistency The authors present two differentiable methods for cycleconsistency; cycleback classification and cycleback regression. In order to make their cycleconsistency formulation differentiable, the authors use the concept of soft nearest neighbor: ![image](https://userimages.githubusercontent.com/18450628/638910615477a680c9b211e99e4f55e11d81787d.png) ## cycleback classification Once the soft nearest neighbor v~ is computed, the euclidean distance between v~ and all u_k is computed for a total of N frames (assume N frames in U) in a logit vector x_k and softmaxed to a prediction ŷ = softmax(x): ![image](https://userimages.githubusercontent.com/18450628/6389145038c0d000c9b311e989e9d257be3fd175.png). ![image](https://userimages.githubusercontent.com/18450628/63891746e92ed400c9b311e9982c078ebd1d747e.png) Note the clever use of the negative, which will ensure the softmax selects for the highest distance. The ground truth label is a onehot vector of size 1xN where position i is set to 1 and all others are set to 0. Crossentropy is then used to compute the loss. ## cycleback regression The concept is very similar to cycleback classification up to the soft nearest neighbor calculation. However the similarity metric of v~ back to u_k is not computed using euclidean distance but instead soft nearest neighbor again: ![image](https://userimages.githubusercontent.com/18450628/638932319bb46600c9b711e991450c13e8ede5e6.png) The idea is that they want to penalize the network less for "close enough" guesses. This is done by imposing a gaussian prior on beta centered around the actual closest neighbor i. ![image](https://userimages.githubusercontent.com/18450628/6389343929905100c9b811e981fefab238021c6d.png) The following figure summarizes the pipeline: ![image](https://userimages.githubusercontent.com/18450628/638962789f4beb00c9bf11e98be15f1ad67199c7.png) # Datasets All training is done in a selfsupervised fashion. To evaluate their methods, they annotate the Pouring dataset, which they introduce and the Penn Action dataset. To annotate the datasets, they limit labels to specific events and phases between phases ![image](https://userimages.githubusercontent.com/18450628/63894846affa6200c9bb11e989192f2cdf720a88.png) # Model The network consists of two parts: an encoder network and an embedder network. ## Encoder They experiment with two encoders: * ResNet50 pretrained on imagenet. They use conv_4 layer's output which is 14x14x1024 as the encoder scheme. * A Vgglike model from scratch whose encoder output is 14x14x512. ## Embedder They then fuse the k following frame encodings along the time dimension, and feed it to an embedder network which uses 3D Convolutions and 3D pooling to reduce it to a 1x128 feature vector. They find that k=2 works pretty well.
1 Comments

[link]
The paper introduces a sequential variational autoencoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data. #### What is DRAW: The deep recurrent attention writer (DRAW) model has two differences with respect to other variational autoencoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder. #### What do we gain? The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem. #### What follows? A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network. #### Like: * As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way. * The attention model is fully differentiable. #### Dislike: * I think a better exposition of the attention mechanism would improve this paper. 
[link]
This paper presents a novel neural network approach (though see [here](https://www.facebook.com/hugo.larochelle.35/posts/172841743130126?pnref=story) for a discussion on prior work) to density estimation, with a focus on image modeling. At its core, it exploits the following property on the densities of random variables. Let $x$ and $z$ be two random variables of equal dimensionality such that $x = g(z)$, where $g$ is some bijective and deterministic function (we'll note its inverse as $f = g^{1}$). Then the change of variable formula gives us this relationship between the densities of $x$ and $z$: $p_X(x) = p_Z(z) \left{\rm det}\left(\frac{\partial g(z)}{\partial z}\right)\right^{1}$ Moreover, since the determinant of the Jacobian matrix of the inverse $f$ of a function $g$ is simply the inverse of the Jacobian of the function $g$, we can also write: $p_X(x) = p_Z(f(x)) \left{\rm det}\left(\frac{\partial f(x)}{\partial x}\right)\right$ where we've replaced $z$ by its deterministically inferred value $f(x)$ from $x$. So, the core of the proposed model is in proposing a design for bijective functions $g$ (actually, they design its inverse $f$, from which $g$ can be derived by inversion), that have the properties of being easily invertible and having an easytocompute determinant of Jacobian. Specifically, the authors propose to construct $f$ from various modules that all preserve these properties and allows to construct highly nonlinear $f$ functions. Then, assuming a simple choice for the density $p_Z$ (they use a multidimensional Gaussian), it becomes possible to both compute $p_X(x)$ tractably and to sample from that density, by first samples $z\sim p_Z$ and then computing $x=g(z)$. The building blocks for constructing $f$ are the following: **Coupling layers**: This is perhaps the most important piece. It simply computes as its output $b\odot x + (1b) \odot (x \odot \exp(l(b\odot x)) + m(b\odot x))$, where $b$ is a binary mask (with half of its values set to 0 and the others to 1) over the input of the layer $x$, while $l$ and $m$ are arbitrarily complex neural networks with input and output layers of equal dimensionality. In brief, for dimensions for which $b_i = 1$ it simply copies the input value into the output. As for the other dimensions (for which $b_i = 0$) it linearly transforms them as $x_i * \exp(l(b\odot x)_i) + m(b\odot x)_i$. Crucially, the bias ($m(b\odot x)_i$) and coefficient ($\exp(l(b\odot x)_i)$) of the linear transformation are nonlinear transformations (i.e. the output of neural networks) that only have access to the masked input (i.e. the nontransformed dimensions). While this layer might seem odd, it has the important property that it is invertible and the determinant of its Jacobian is simply $\exp(\sum_i (1b_i) l(b\odot x)_i)$. See Section 3.3 for more details on that. **Alternating masks**: One important property of coupling layers is that they can be stacked (i.e. composed), and the resulting composition is still a bijection and is invertible (since each layer is individually a bijection) and has a tractable determinant for its Jacobian (since the Jacobian of the composition of functions is simply the multiplication of each function's Jacobian matrix, and the determinant of the product of square matrices is the product of the determinant of each matrix). This is also true, even if the mask $b$ of each layer is different. Thus, the authors propose using masks that alternate across layer, by masking a different subset of (half of) the dimensions. For images, they propose using masks with a checkerboard pattern (see Figure 3). Intuitively, alternating masks are better because then after at least 2 layers, all dimensions have been transformed at least once. **Squeezing operations**: Squeezing operations corresponds to a reorganization of a 2D spatial layout of dimensions into 4 sets of features maps with spatial resolutions reduced by half (see Figure 3). This allows to expose multiple scales of resolutions to the model. Moreover, after a squeezing operation, instead of using a checkerboard pattern for masking, the authors propose to use a per channel masking pattern, so that "the resulting partitioning is not redundant with the previous checkerboard masking". See Figure 3 for an illustration. Overall, the models used in the experiments usually stack a few of the following "chunks" of layers: 1) a few coupling layers with alternating checkboard masks, 2) followed by squeezing, 3) followed by a few coupling layers with alternating channelwise masks. Since the output of each layerschunk must technically be of the same size as the input image, this could become expensive in terms of computations and space when using a lot of layers. Thus, the authors propose to explicitly pass on (copy) to the very last layer ($z$) half of the dimensions after each layerschunk, adding another chunk of layers only on the other half. This is illustrated in Figure 4b. Experiments on CIFAR10, and 32x32 and 64x64 versions of ImageNet show that the proposed model (coined the realvalued nonvolume preserving or Real NVP) has competitive performance (in bits per dimension), though slightly worse than the Pixel RNN. **My Two Cents** The proposed approach is quite unique and thought provoking. Most interestingly, it is the only powerful generative model I know that combines A) a tractable likelihood, B) an efficient / onepass sampling procedure and C) the explicit learning of a latent representation. While achieving this required a model definition that is somewhat unintuitive, it is nonetheless mathematically really beautiful! I wonder to what extent Real NVP is penalized in its results by the fact that it models pixels as realvalued observations. First, it implies that its estimate of bits/dimensions is an upper bound on what it could be if the uniform subpixel noise was integrated out (see Equations 345 of [this paper](http://arxiv.org/pdf/1511.01844v3.pdf)). Moreover, the authors had to apply a nonlinear transformation (${\rm logit}(\alpha + (1\alpha)\odot x)$) to the pixels, to spread the $[0,255]$ interval further over the reals. Since the Pixel RNN models pixels as discrete observations directly, the Real NVP might be at a disadvantage. I'm also curious to know how easy it would be to do conditional inference with the Real NVP. One could imagine doing approximate MAP conditional inference, by clamping the observed dimensions and doing gradient descent on the loglikelihood with respect to the value of remaining dimensions. This could be interesting for image completion, or for structured output prediction with realvalued outputs in general. I also wonder how expensive that would be. In all cases, I'm looking forward to saying interesting applications and variations of this model in the future! 