![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
One of the dominant narratives of the deep learning renaissance has been the value of well-designed inductive bias - structural choices that shape what a model learns. The biggest example of this can be found in convolutional networks, where models achieve a dramatic parameter reduction by having features maps learn local patterns, which can then be re-used across the whole image. This is based on the prior belief that patterns in local images are generally locally contiguous, and so having feature maps that focus only on small (and gradually larger) local areas is a good fit for that prior. This paper operates in a similar spirit, except its input data isn’t in the form of an image, but a graph: the social graph of multiple agents operating within a Multi Agent RL Setting. In some sense, a graph is just a more general form of a pixel image: where a pixel within an image has a fixed number of neighbors, which have fixed discrete relationships to it (up, down, left, right), nodes within graphs have an arbitrary number of nodes, which can have arbitrary numbers and types of attributes attached to that relationship. The authors of this paper use graph networks as a sort of auxiliary information processing system alongside a more typical policy learning framework, on tasks that require group coordination and knowledge sharing to complete successfully. For example, each agent might be rewarded based on the aggregate reward of all agents together, and, in the stag hunt, it might require collaborative effort by multiple agents to successfully “capture” a reward. Because of this, you might imagine that it would be valuable to be able to predict what other agents within the game are going to do under certain circumstances, so that you can shape your strategy accordingly. The graph network used in this model represents both agents and objects in the environment as nodes, which have attributes including their position, whether they’re available or not (for capture-able objects), and what their last action was. As best I can tell, all agents start out with directed connections going both ways to all other agents, and to all objects in the environment, with the only edge attribute being whether the players are on the same team, for competitive environments. Given this setup, the graph network works through a sort of “diffusion” of information, analogous to a message passing algorithm. At each iteration (analogous to a layer), the edge features pull in information from their past value and sender and receiver nodes, as well as from a “global feature”. Then, all of the nodes pull in information from their edges, and their own past value. Finally, this “global attribute” gets updated based on summations over the newly-updated node and edge information. (If you were predicting attributes that were graph-level attributes, this global attribute might be where you’d do that prediction. However, in this case, we’re just interested in predicting agent-level actions). https://i.imgur.com/luFlhfJ.png All of this has the effect of explicitly modeling agents as entities that both have information, and have connections to other entities. One benefit the authors claim of this structure is that it allows them more interpretability: when they “play out” the values of their graph network, which they call a Relational Forward Model or RFM, they observe edge values for two agents go up if those agents are about to collaborate on an action, and observe edge values for an agent and an object go up before that object is captured. Because this information is carefully shaped and structured, it makes it easier for humans to understand, and, in the tests the authors ran, appears to also help agents do better in collaborative games. https://i.imgur.com/BCKSmIb.png While I find graph networks quite interesting, and multi-agent learning quite interesting, I’m a little more uncertain about the inherent “graphiness” of this problem, since there aren’t really meaningful inherent edges between agents. One thing I am curious about here is how methods like these would work in situations of sparser graphs, or, places where the connectivity level between a node’s neighbors, and the average other node in the graph is more distinct. Here, every node is connected to every other node, so the explicit information localization function of graph networks is less pronounced. I might naively think that - to whatever extent the graph is designed in a way that captures information meaningful to the task - explicit graph methods would have an even greater comparative advantage in this setting. ![]() |
[link]
The [paper](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks. ### 1. Learning = Representation + Evaluation + Optimization All machine learning algorithms have three components: * **Representation** for a learner is the set if classifiers/functions that can be possibly learnt. This set is called *hypothesis space*. If a function is not in hypothesis space, it can not be learnt. * **Evaluation** function tells how good the machine learning model is. * **Optimisation** is the method to search for the most optimal learning model. ### 2. Its Generalization That Counts The fundamental goal of machine learning is to generalize beyond the training set. The data used to evaluate the model must be kept separate from the data used to learn the model. When we use generalization as a goal, we do not have access to a function that we can optimize. So we have to use training error as a proxy for test error. ### 3. Data Alone Is Not Enough Since our ultimate goal is generalization (see point 2), there is no such thing as **"enough"** data. Some knowledge beyond the data is needed to generalize beyond the data. Another way to put is "No learner can beat random guessing over all possible functions." But instead of hard-coding assumptions, learners should allow assumptions to be explicitly stated, varied and incorporated automatically into the model. ### 4. Overfitting Has Many Faces One way to interpret overfitting is to break down generalization error into two components: bias and variance. **Bias** is the tendency of the learner to constantly learn the same wrong thing (in the image, a high bias would mean more distance from the centre). **Variance** is the tendency to learn random things irrespective of the signal (in the image, a high variance would mean more scattered points). ![Bias Variance Diagram](https://dl.dropboxusercontent.com/u/56860240/A-Paper-A-Week/BiasVarianceDiagram.png) A more powerful learner (one that can learn many models) need not be better than a less powerful one as they can have a high variance. While noise is not the only reason for overfitting, it can indeed aggravate the problem. Some tools against overfitting are - **cross-validation**, **regularization**, **statistical significance testing**, etc. ### 5. Intuition Fails In High Dimensions Generalizing correctly becomes exponentially harder as dimensionality (number of features) become large. Machine learning algorithms depend on similarity-based reasoning which breaks down in high dimensions as a fixed-size training set covers only a small fraction of the large input space. Moreover, our intuitions from three-dimensional space often do not apply to higher dimensional spaces. So the **curse of dimensionality** may outweigh the benefits of having more features. Though, in most cases, learners benefit from the **blessing of non-uniformity** as data points are concentrated in lower-dimensional manifolds. Learners can implicitly take advantage of this lower effective dimension or use dimensionality reduction techniques. ### 6. Theoretical Guarantees Are Not What They Seem A common type of bound common when dealing with machine learning algorithms is related to the number of samples needed to ensure good generalization. But these bounds are very loose in nature. Moreover, the bound says that given a large enough training dataset, our learner would return a good hypothesis with high probability or would not find a consistent hypothesis. It does not tell us anything about how to select a good hypothesis space. Another common type of bound is the asymptotic bound which says "given infinite data, the learner is guaranteed to output correct classifier". But in practice we never have infinite data and data alone is not enough (see point 3). So theoretical guarantees should be used to understand and drive the algorithm design and not as the only criteria to select algorithm. ### 7. Feature Engineering Is The Key Machine Learning is an iterative process where we train the learner, analyze the results, modify the learner/data and repeat. Feature engineering is a crucial step in this pipeline. Having the right kind of features (independent features that correlate well with the class) makes learning easier. But feature engineering is also difficult because it requires domain specific knowledge which extends beyond just the data at hand (see point 3). ### 8. More Data Beats A Clever Algorithm As a rule of thumb, a dumb algorithm with lots of data beats a clever algorithm with a modest amount of data. But more data means more scalability issues. Fixed size learners (parametric learners) can take advantage of data only to an extent beyond which adding more data does not improve the results. Variable size learners (non-parametric learners) can, in theory, learn any function given sufficient amount of data. Of course, even non-parametric learners are bound by limitations of memory and computational power. ### 9. Learn Many Models, Not Just One In early days of machine learning, the model/learner to be trained was pre-determined and the focus was on tuning it for optimal performance. Then the focus shifted to trying many variants of different learners. Now the focus is on combining the various variants of different algorithms to generate the most optimal results. Such model ensembling techniques include *bagging*, *boosting* and *stacking*. ### 10. Simplicity Does Not Imply Accuracy Though Occam's razor suggests that machine learning models should be kept simple, there is no necessary connection between the number of parameters of a model and its tendency to overfit. The complexity of a model can be related to the size of hypothesis space as smaller spaces allow the hypothesis to be generated by smaller, simpler codes. But there is another side to this picture - A learner with a larger hypothesis space that tries fewer hypotheses is less likely to overfit than one that tries more hypotheses from a smaller space. So hypothesis space size is just a rough guide towards accuracy. Domingos conclude in his [other paper](http://homes.cs.washington.edu/~pedrod/papers/dmkd99.pdf) that "simpler hypotheses should be preferred because simplicity is a virtue in its own right, not because of a hypothetical connection with accuracy." ### 11. Representation Does Not Imply Learnable Just because a function can be represented, does not mean that the function can actually be learnt. Restrictions imposed by data, time and memory, limit the functions that can actually be learnt in a feasible manner. For example, decision tree learners can not learn trees with more leaves than the number of training data points. The right question to ask is "whether a function can be learnt" and not "whether a function can be represented". ### 12. Correlation Does Not Imply Causation Correlation may hint towards a possible cause and effect relationship but that needs to be investigated and validated. On the face of it, correlation can not be taken as proof of causation. ![]() |
[link]
$\bf Summary:$ The paper is about squeezing the number of parameters in a convolutional neural network. The number of parameters in a convolutional layer is given by (number of input channels)$\times$(number of filters)$\times$(size of filter$\times$size of filter). The paper proposes 2 strategies: (i) replace 3x3 filters with 1x1 filters and (ii) decrease the number of input channels. They assume the budget of the filter is given, i,e., they do not tinker with the number of filters. Decrease in number of parameters will lead to less accuracy. To compensate, the authors propose to downsample late in the network. The results are quite impressive. Compared to AlexNet, they achieve a 50x reduction is model size while preserving the accuracy. Their model can be further compressed with existing methods like Deep Compression which are orthogonal to this paper's approach and this can give in total of around 510x reduction while still preserving accuracy of AlexNet. $\bf Question$: The impact on running times (specially on feed forward phase which may be more typical on embedded devices) is not clear to me. Is it certain to be reduced as well or at least be *no worse* than the baseline models? ![]() |
[link]
This paper presents Swapout, a simple dropout method applied to Residual Networks (ResNets). In a ResNet, a layer $Y$ is computed from the previous layer $X$ as $Y = X + F(X)$ where $F(X)$ is essentially the composition of a few convolutional layers. Swapout simply applies dropout separately on both terms of a layer's equation: $Y = \Theta_1 \odot X + \Theta_2 \odot F(X)$ where $\Theta_1$ and $\Theta_2$ are independent dropout masks for each term. The paper shows that this form of dropout is at least as good or superior as other forms of dropout, including the recently proposed [stochastic depth dropout][1]. Much like in the stochastic depth paper, better performance is achieved by linearly increasing the dropout rate (from 0 to 0.5) from the first hidden layer to the last. In addition to this observation, I also note the following empirical observations: 1. At test time, averaging the output layers of multiple dropout mask samples (referenced to as stochastic inference) is better than replacing the masks by their expectation (deterministic inference), the latter being the usual standard. 2. Comparable performance is achieved by making the ResNet wider (e.g. 4 times) and with fewer layers (e.g. 32) than the orignal ResNet work with thin but very deep (more than 1000 layers) ResNets. This would confirm a similar observation from [this paper][2]. Overall, these are useful observations to be aware of for anyone wanting to use ResNets in practice. [1]: http://arxiv.org/abs/1603.09382v1 [2]: https://arxiv.org/abs/1605.07146 ![]() |
[link]
This paper is about Convolutional Neural Networks for Computer Vision. It was the first break-through in the ImageNet classification challenge (LSVRC-2010, 1000 classes). ReLU was a key aspect which was not so often used before. The paper also used Dropout in the last two layers. ## Training details * Momentum of 0.9 * Learning rate of $\varepsilon$ (initialized at 0.01) * Weight decay of $0.0005 \cdot \varepsilon$. * Batch size of 128 * The training took 5 to 6 days on two NVIDIA GTX 580 3GB GPUs. ## See also * [Stanford presentation](http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf) ![]() |