![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
This paper is ultimately relatively straightforward, for all that it's embedded in the somewhat new-to-me literature around graph-based Neural Architecture Search - the problem of iterating through options to find a graph representing an optimized architecture. The authors want to understand whether in this problem, as in many others in deep learning, we can benefit from building our supervised models off of representations learned during an unsupervised pretraining step. In this case, the unsupervised pretraining is conceptually simple - a variational autoencoder - even though the components of the VAE are more complex by dint of being made up of graph networks. This autoencoder, termed arch2vec, is trained on a simple reconstruction loss, and uses the Graph Isomorphism Network (or, GIN) architecture in its encoder, rather than a more typical Graph Convolutional Network. I don't feel like I fully follow the intuitive difference between these two structures, but have the general sense that GIN architectures are simpler; calculating a weighted sum of current central node features with the features of neighboring nodes, rather than learning a function of the full concatenated (current_node, sum_of_neighbors) vector. First, the authors investigate the quality of their embedding space, compared to the representation implicitly learned by doing end-to-end supervised (i.e. with accuracies as labels) NAS. They show that (1) distances in their continuous embedding space correlate more strongly with the edit distance between graphs, compared to the embedding learned by the supervised model, and that (2) their embedding fills more of the space (as would be expected from the KL regularization term) and leads to high-performing networks being naturally concentrated within the space. https://i.imgur.com/SavZnce.png Looking into their representation as an initialization point, they demonstrate that their initializations do lead to lower long-term regret over the course of the architecture search process, particularly differentiating themselves from random initializations at the end of training. https://i.imgur.com/4DG7lZd.png The authors argue that this is because the representations learned by the supervised methods are "biased towards weight-free operations, which are often preferred in the early stage of the search process, resulting in lower final accuracies." I admit I don't fully understand why this would be true, though they do cite a few papers they say demonstrate it. My initial thought was that weight-free architectures would overperform early in the training of each individual network, but my understanding was that the dataset used here is a labeled static dataset of architectures and accuracies, so the within-training-run dynamics wouldn't obviously play a role. Nevertheless, there does seem to be empirical benefit that comes from using these pretrained representations, even if I don't follow the intuition behind it fully. ![]() |
[link]
#### Introduction * The paper demonstrates how simple CNNs, built on top of word embeddings, can be used for sentence classification tasks. * [Link to the paper](https://arxiv.org/abs/1408.5882) * [Implementation](https://github.com/shagunsodhani/CNN-Sentence-Classifier) #### Architecture * Pad input sentences so that they are of the same length. * Map words in the padded sentence using word embeddings (which may be either initialized as zero vectors or initialized as word2vec embeddings) to obtain a matrix corresponding to the sentence. * Apply convolution layer with multiple filter widths and feature maps. * Apply max-over-time pooling operation over the feature map. * Concatenate the pooling results from different layers and feed to a fully-connected layer with softmax activation. * Softmax outputs probabilistic distribution over the labels. * Use dropout for regularisation. #### Hyperparameters * RELU activation for convolution layers * Filter window of 3, 4, 5 with 100 feature maps each. * Dropout - 0.5 * Gradient clipping at 3 * Batch size - 50 * Adadelta update rule. #### Variants * CNN-rand * Randomly initialized word vectors. * CNN-static * Uses pre-trained vectors from word2vec and does not update the word vectors. * CNN-non-static * Same as CNN-static but updates word vectors during training. * CNN-multichannel * Uses two set of word vectors (channels). * One set is updated and other is not updated. #### Datasets * Sentiment analysis datasets for Movie Reviews, Customer Reviews etc. * Classification data for questions. * Maximum number of classes for any dataset - 6 #### Strengths * Good results on benchmarks despite being a simple architecture. * Word vectors obtained by non-static channel have more meaningful representation. #### Weakness * Small data with few labels. * Results are not very detailed or exhaustive. ![]() |
[link]
This is an interesting paper that makes a fairly radical claim, and I haven't fully decided whether what they find is an interesting-but-rare corner case, or a more fundamental weakness in the design of neural nets. The claim is: neural nets prefer learning simple features, even if there exist complex features that are equally or more predictive, and even if that means learning a classifier with a smaller margin - where margin means "the distance between the decision boundary and the nearest-by data". A large-margin classifier is preferable in machine learning because the larger the margin, the larger the perturbation that would have to be made - by an adversary, or just by the random nature of the test set - to trigger misclassification. https://i.imgur.com/PJ6QB6h.png This paper defines simplicity and complexity in a few ways. In their simulated datasets, a feature is simpler when the decision boundary along that axis requires fewer piecewise linear segments to separate datapoints. (In the example above, note that having multiple alternating blocks still allows for linear separation, but with a higher piecewise linear requirement). In their datasets that concatenate MNIST and CIFAR images, the MNIST component represents the simple feature. The authors then test which models use which features by training a model with access to all of the features - simple and complex - and then testing examples where one set of features is sampled in alignment with the label, and one set of features is sampled randomly. If the features being sampled randomly are being used by the model, perturbing them like this should decrease the test performance of the model. For the simulated datasets, a fully connected network was used; for the MNIST/CIFAR concatenation, a variety of different image classification convolutional architectures were tried. The paper finds that neural networks will prefer to use the simpler feature to the complete exclusion of more complex features, even if the complex feature is slightly more predictive (can achieve 100 vs 95% separation). The authors go on to argue that what they call this Extreme Simplicity Bias, or Extreme SB, might actually explain some of the observed pathologies in neural nets, like relying on spurious features or being subject to adversarial perturbations. They claim that spurious features - like background color or texture - will tend to be simpler, and that their theory explains networks' reliance on them. Additionally, relying completely or predominantly on single features means that a perturbation along just that feature can substantially hurt performance, as opposed to a network using multiple features, all of which must be perturbed to hurt performance an equivalent amount. As I mentioned earlier, I feel like I'd need more evidence before I was strongly convinced by the claims made in this paper, but they are interestingly provocative. On a broader level, I think a lot of the difficulties in articulating why we expect simpler features to perform well come from an imprecision in thinking in language around the idea - we think of complex features as inherently brittle and high-dimensional, but this paper makes me wonder how well our existing definitions of simplicity actually match those intuitions. ![]() |
[link]
Cover's Universal Portfolio is an information-theoretic portfolio optimization algorithm that utilizes constantly rebalanced porfolios (CRP). A CRP is one in which the distribution of wealth among stocks in the portfolio remains the same from period to period. Universal Portfolio strictly performs rebalancing based on historical pricing, making no assumptions about the underlying distribution of the prices. The wealth achieved by a CRP over n periods is: $S_n(b,x^n) = \displaystyle \prod_{n}^{i=1} b \cdot x$ The key takeaway: Where $\mathrm{b}$ is the allocation vector. Cover takes the integral of the wealth over the entire portfolio to give $b_{t+1}$. This is what makes it "universal". Most implementations in practice do this discretely, by creating a matrix $\mathrm{B}$ with each row containing a combination of the percentage allocatio, and calculating $\mathrm{S} = \mathrm{B\dot x}$. Cover mentions trading costs will eat away most of the gains, especially if this algorithm is allowed to rebalance daily. Nowadays, there are commission-free brokers. See this summary for Universal Portfolios without transaction costs: \cite{conf/colt/BlumK97} ![]()
2 Comments
|
[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. ![]() |