![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. ![]() |
[link]
The paper introduces a sequential variational auto-encoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data. #### What is DRAW: The deep recurrent attention writer (DRAW) model has two differences with respect to other variational auto-encoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder. #### What do we gain? The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem. #### What follows? A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network. #### Like: * As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way. * The attention model is fully differentiable. #### Dislike: * I think a better exposition of the attention mechanism would improve this paper. ![]() |
[link]
#### Introduction * Introduces a new global log-bilinear regression model which combines the benefits of both global matrix factorization and local context window methods. #### Global Matrix Factorization Methods * Decompose large matrices into low-rank approximations. * eg - Latent Semantic Analysis (LSA) ##### Limitations * Poor performance on word analogy task * Frequent words contribute disproportionately high to the similarity measure. #### Shallow, Local Context-Based Window Methods * Learn word representations using adjacent words. * eg - Continous bag-of-words (CBOW) model and skip-gram model. ##### Limitations * Since they do not operate directly on the global co-occurrence counts, they can not utilise the statistics of the corpus effectively. #### GloVe Model * To capture the relationship between words $i$ and $j$, word vector models should use ratios of co-occurene probabilites (with other words $k$) instead of using raw probabilites themselves. * In most general form: * $F(w_{i}, w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * We want $F$ to encode information in the vector space (which have a linear structure), so we can restrict to the difference of $w_{i}$ and $w_{j}$ * $F(w_{i} - w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * Since right hand side is a scalar and left hand side is a vector, we take dot product of the arguments. * $F( (w_{i} - w_{j})^{T}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * *F* should be invariant to order of the word pair $i$ and $j$. * $F(w_{i}^{T}w_{k}^{~}) = P_{ik}$ * Doing further simplifications and optimisations (refer paper), we get cost function, * $J = \sum_{\text{over all i, j pairs in the vocabulary}}[w_{i}^{T}w_{k}^{~} + b_{i} + b_{k}^{~} - log(X_{ik})]^{2}$ * $f$ is a weighing function. * $f(x) = min((x/x_{max})^{\alpha}, 1)$ * Typical values, $x_{\max} = 100$ and $\alpha = 3/4$ * *b* are the bias terms. ##### Complexity * Depends on a number of non-zero elements in the input matrix. * Upper bound by the square of vocabulary size * Since for shallow window-based approaches, complexity depends on $|C|$ (size of the corpus), tighter bounds are needed. * By modelling number of co-occurrences of words as power law function of frequency rank, the complexity can be shown to be proportional to $|C|^{0.8}$ #### Evaluation ##### Tasks * Word Analogies * a is to b as c is to ___? * Both semantic and syntactic pairs * Find closest d to $w_{b} - w_{c} + w_{a}$ (using cosine similarity) * Word Similarity * Named Entity Recognition ##### Datasets * Wikipedia Dumps - 2010 and 2014 * Gigaword5 * Combination of Gigaword5 and Wikipedia2014 * CommonCrawl * 400,000 most frequent words considered from the corpus. ##### Hyperparameters * Size of context window. * Whether to distinguish left context from right context. * $f$ - Word pairs that are $d$ words apart contribute $1/d$ to the total count. * $xmax = 100$ * $\alpha = 3/4$ * AdaGrad update ##### Models Compared With * Singular Value Decomposition * Continous Bag-Of-Words * Skip-Gram ##### Results * Glove outperforms all other models significantly. * Diminishing returns for vectors larger than 200 dimensions. * Small and asymmetric context windows (context window only to the left) works better for syntactic tasks. * Long and symmetric context windows (context window to both the sides) works better for semantic tasks. * Syntactic task benefited from larger corpus though semantic task performed better with Wikipedia instead of Gigaword5 probably due to the comprehensiveness of Wikipedia and slightly outdated nature of Gigaword5. * Word2vec’s performance decreases if the number of negative samples increases beyond about 10. * For the same corpus, vocabulary, and window size GloVe consistently achieves better results, faster. ![]() |
[link]
Large-scale transformers on unsupervised text data have been wildly successful in recent years; arguably, the most successful single idea in the last ~3 years of machine learning. Given that, it's understandable that different domains within ML want to take their shot at seeing whether the same formula will work for them as well. This paper applies the principles of (1) transformers and (2) large-scale unlabeled data to the problem of learning informative embeddings of molecular graphs. Labeling is a problem in much of machine learning - it's costly, and narrowly defined in terms of a certain task - but that problem is even more exacerbated when it comes to labeling properties of molecules, since they typically require wetlab chemistry to empirically measure. Given that, and also given the fact that we often want to predict new properties - like effectiveness against a new targetable drug receptor - that we don't yet have data for, finding a way to learn and transfer from unsupervised data has the potential to be quite valuable in the molecular learning sphere. There are two main conceptual parts to this paper and its method - named GROVER, in true-to-ML-form tortured acronym style. The first is the actual architecture of their model itself, which combines both a message-passing Graph Neural Network to aggregate local information, and a Transformer to aggregate global information. The paper was a bit vague here, but the way I understand it is: https://i.imgur.com/JY4vRdd.png - There are parallel GNN + Transformer stacks for both edges and nodes, each of which outputs both a node and edge embedding, for four embeddings total. I'll describe the one for nodes, and the parallel for edges operates the same way, except that hidden states live on edges rather than nodes, and attention is conducted over edges rather than nodes - In the NodeTransformer version, a message passing NN (of I'm not sure how many layers) performs neighborhood aggregation (aggregating the hidden states of neighboring nodes and edges, then weight-transforming them, then aggregating again) until each node has a representation that has "absorbed" in information from a few hops out of its surrounding neighborhood. My understanding is that there is a separate MPNN for queries, keys, and values, and so each nodes end up with three different vectors for these three things. - Multi-headed attention is then performed over these node representations, in the normal way, where all keys and queries are dot-product-ed together, and put into a softmax to calculate a weighted average over the values - We now have node-level representations that combine both local and global information. These node representations are then aggregated into both node and edge representations, and each is put into a MLP layer and Layer Norm before finally outputting a node-based node and edge representation. This is then joined by an edge-based node and edge representation from the parallel stack. These are aggregated on a full-graph level to predict graph-level properties https://i.imgur.com/NNl6v4Y.png The other component of the GROVER model is the way this architecture is actually trained - without explicit supervised labels. The authors use two tasks - one local, and one global. The local task constructs labels based on local contextual properties of a given atom - for example, the atom here has one double-bonded Nitrogen and one single-bonded Oxygen in its local environment - and tries to predict those labels given the representations of that atom (or node). The global task uses RDKit (an analytically constructed molecular analysis kit) to identify 85 different modifs or functional groups in the molecule, and encodes those into an 85-long one-hot vector that is being predicted on a graph level. https://i.imgur.com/jzbYchA.png With these two components, GROVER is pretrained on 10 million unlabeled molecules, and then evaluated in transfer settings where its representations are fine-tuned on small amounts of labeled data. The results are pretty impressive - it achieves new SOTA performance by relatively large amounts on all tasks, even relative to exist semi-supervised pretraining methods that similarly have access to more data. The authors perform ablations to show that it's important to do the graph-aggregation step before a transformer (the alternative being just doing a transformer on raw node and edge features), and also show that their architecture without pretraining (just used directly in downstream tasks) also performs worse. One thing I wish they'd directly ablated was the value-add of the local (also referred to as "contextual") and global semi-supervised tasks. Naively, I'd guess that most of the performance gain came from the global task, but it's hard to know without them having done the test directly. ![]() |
[link]
This paper models object detection as a regression problem for bounding boxes and object class probabilities with a single pass through the CNN. The main contribution is the idea of dividing the image into a 7x7 grid, and having each cell predict a distribution over class labels as well as a bounding box for the object whose center falls into it. It's much faster than R-CNN and Fast R-CNN, as the additional step of extracting region proposals has been removed. ## Strengths - Works real-time. Base model runs at 45fps and a faster version goes up to 150fps, and they claim that it's more than twice as fast as other works on real-time detection. - End-to-end model; Localization and classification errors can be jointly optimized. - YOLO makes more localization errors and fewer background mistakes than Fast R-CNN, so using YOLO to eliminate false background detections from Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN is much slower). ## Weaknesses / Notes - Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN). - Performs worse at detecting small objects, as at most one object per grid cell can be detected. ![]() |