ShortScience.org - Making Science Accessible!

4

[link] Summary by Shagun Sodhani 8 years ago

The [paper](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks.

### 1. Learning = Representation + Evaluation + Optimization

All machine learning algorithms have three components:
*    **Representation** for a  learner is the set if classifiers/functions that can be possibly learnt. This set is called *hypothesis space*. If a function is not in hypothesis space, it can not be learnt.
*    **Evaluation** function tells how good the machine learning model is.
*    **Optimisation** is the method to search for the most optimal learning model.

### 2. Its Generalization That Counts

The fundamental goal of machine learning is to generalize beyond the training set. The data used to evaluate the model must be kept separate from the data used to learn the model. When we use generalization as a goal, we do not have access to a function that we can optimize. So we have to use training error as a proxy for test error.

### 3. Data Alone Is Not Enough

Since our ultimate goal is generalization (see point 2), there is no such thing as **"enough"**  data. Some knowledge beyond the data is needed to generalize beyond the data. Another way to put is "No learner can beat random guessing over all possible functions." But instead of hard-coding assumptions, learners should allow assumptions to be explicitly stated, varied and incorporated automatically into the model.

### 4. Overfitting Has Many Faces

One way to interpret overfitting is to break down generalization error into two components: bias and variance. **Bias** is the tendency of the learner to constantly learn the same wrong thing (in the image, a high bias would mean more distance from the centre). **Variance** is the tendency to learn random things irrespective of the signal (in the image, a high variance would mean more scattered points). 

![Bias Variance Diagram](https://dl.dropboxusercontent.com/u/56860240/A-Paper-A-Week/BiasVarianceDiagram.png)

A more powerful learner (one that can learn many models) need not be better than a less powerful one as they can have a high variance.  While noise is not the only reason for overfitting, it can indeed aggravate the problem. Some tools against overfitting are - **cross-validation**, **regularization**, **statistical significance testing**, etc. 

### 5. Intuition Fails In High Dimensions

Generalizing correctly becomes exponentially harder as dimensionality (number of features) become large. Machine learning algorithms depend on similarity-based reasoning which breaks down in high dimensions as a fixed-size training set covers only a small fraction of the large input space. Moreover, our intuitions from three-dimensional space often do not apply to higher dimensional spaces. So the **curse of dimensionality** may outweigh the benefits of having more features. Though, in most cases, learners benefit from the **blessing of non-uniformity** as data points are concentrated in lower-dimensional manifolds. Learners can implicitly take advantage of this lower effective dimension or use dimensionality reduction techniques.

### 6. Theoretical Guarantees Are Not What They Seem  

A common type of bound common when dealing with machine learning algorithms is related to the number of samples needed to ensure good generalization. But these bounds are very loose in nature. Moreover, the bound says that given a large enough training dataset, our learner would return a good hypothesis with high probability or would not find a consistent hypothesis. It does not tell us anything about how to select a good hypothesis space.

Another common type of bound is the asymptotic bound which says "given infinite data, the learner is guaranteed to output correct classifier". But in practice we never have infinite data and data alone is not enough (see point 3). So theoretical guarantees should be used to understand and drive the algorithm design and not as the only criteria to select algorithm.

### 7. Feature Engineering Is The Key

Machine Learning is an iterative process where we train the learner, analyze the results, modify the learner/data and repeat. Feature engineering is a crucial step in this pipeline. Having the right kind of features (independent features that correlate well with the class) makes learning easier. But feature engineering is also difficult because it requires domain specific knowledge which extends beyond just the data at hand (see point 3).

### 8. More Data Beats A Clever Algorithm

As a rule of thumb, a dumb algorithm with lots of data beats a clever algorithm with a modest amount of data. But more data means more scalability issues. Fixed size learners (parametric learners) can take advantage of data only to an extent beyond which adding more data does not improve the results. Variable size learners (non-parametric learners) can, in theory, learn any function given sufficient amount of data. Of course, even non-parametric learners are bound by limitations of memory and computational power. 

    
### 9. Learn Many Models, Not Just One

In early days of machine learning, the model/learner to be trained was pre-determined and the focus was on tuning it for optimal performance. Then the focus shifted to trying many variants of different learners. Now the focus is on combining the various variants of different algorithms to generate the most optimal results. Such model ensembling techniques include *bagging*, *boosting* and *stacking*.

### 10. Simplicity Does Not Imply Accuracy

Though Occam's razor suggests that machine learning models should be kept simple, there is no necessary connection between the number of parameters of a model and its tendency to overfit. The complexity of a model can be related to the size of hypothesis space as smaller spaces allow the hypothesis to be generated by smaller, simpler codes. But there is another side to this picture - A learner with a larger hypothesis space that tries fewer hypotheses is less likely to overfit than one that tries more hypotheses from a smaller space. So hypothesis space size is just a rough guide towards accuracy. Domingos conclude in his [other paper](http://homes.cs.washington.edu/~pedrod/papers/dmkd99.pdf) that "simpler hypotheses should be preferred because simplicity is a virtue in its own right, not because of a hypothetical connection with accuracy."


### 11. Representation Does Not Imply Learnable

Just because a function can be represented, does not mean that the function can actually be learnt. Restrictions imposed by data, time and memory, limit the functions that can actually be learnt in a feasible manner. For example, decision tree learners can not learn trees with more leaves than the number of training data points. The right question to ask is "whether  a function can be learnt" and not "whether a function can be represented". 

### 12. Correlation Does Not Imply Causation

Correlation may hint towards a possible cause and effect relationship but that needs to be investigated and validated. On the face of it, correlation can not be taken as proof of causation.

4

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* Open-domain Question Answering (Open QA) - efficiently querying large-scale knowledge base(KB) using natural language.
* Two main approaches:
* Information Retrieval
* Transform question (in natural language) into a valid query(in terms of KB) to get a broad set of candidate answers.
* Perform fine-grained detection on candidate answers.
* Semantic Parsing
* Interpret the correct meaning of the question and convert it into an exact query.
* Limitations:
* Human intervention to create lexicon, grammar, and schema.
* This work builds upon the previous work where an embedding model learns low dimensional vector representation of words and symbols.
* [Link](https://arxiv.org/abs/1406.3676) to the paper.

#### Task Definition

* Input - Training set of questions (paired with answers).
* KB providing a structure among the answers.
* Answers are entities in KB and questions are strings with one identified KB entity.
* The paper has used FREEBASE as the KB.
* Datasets
* WebQuestions - Built using FREEBASE, Google Suggest API, and Mechanical Turk.
* FREEBASE triplets transformed into questions.
* Clue Web Extractions dataset with entities linked with FREEBASE triplets.
* Dataset of paraphrased questions using WIKIANSWERS.

#### Embedding Questions and Answers

* Model learns low-dimensional vector embeddings of words in question entities and relation types of FREEBASE such that questions and their answers are represented close to each other in the joint embedding space.
* Scoring function $S(q, a)$, where $q$ is a question and $a$ is an answer, generates high score if $a$ answers $q$.
* $S(q, a) = f(q)^{T} . g(a)$
* $f(q)$ maps question to embedding space.
* $f(q) = W \phi (q)$
* $W$ is a matrix of dimension $K * N$
* $K$ - dimension of embedding space (hyper parameter).
* $N$ - total number of words/entities/relation types.
* $\psi(q)$ - Sparse Vector encoding the number of times a word appears in $q$.
* Similarly, $g(a) = W \psi (a)$ maps answer to embedding space.
* $\psi(a)$ gives answer representation, as discussed below.

#### Possible Representations of Candidate Answers

* Answer represented as a **single entity** from FREEBASE and TBD is a one-of-N encoded vector.
* Answer represented as a **path** from question to answer. The paper considers only one or two hop paths resulting in 3-of-N or 4-of-N encoded vectors(middle entities are not recorded).
* Encode the above two representations using **subgraph representation** which represents both the path and the entire subgraph of entities connected to answer entity as a subgraph. Two embedding representations are used to differentiate between entities in path and entities in the subgraph.
* SubGraph approach is based on the hypothesis that including more information about the answers would improve results.

#### Training and Loss Function

* Minimize margin based ranking loss to learn matrix $W$.
* Stochastic Gradient Descent, multi-threaded with Hogwild.

#### Multitask Training of Embeddings

* To account for a large number of synthetically generated questions, the paper also multi-tasks the training of model with paraphrased prediction.
* Scoring function $S_{prp} (q1, q2) = f(q1)^{T} f(q2)$, where $f$ uses the same weight matrix $W$ as before.
* High score is assigned if $q1$ and $q2$ belong to same paraphrase cluster.
* Additionally, the model multitasks the task of mapping embeddings of FREEBASE entities (mids) to actual words.

#### Inference

* For each question, a candidate set is generated.
* The answer (from candidate set) with the highest set is reported as the correct answer.
* Candidate set generation strategy
* $C_1$ - All KB triplets containing the KB entity from the question forms a candidate set. Answers would be limited to 1-hop paths.
* $C_2$ - Rank all relation types and keep top 10 types and add only those 2-hop candidates where the selected relations appear in the path.

#### Results

* $C_2$ strategy outperforms $C_1$ approach supporting the hypothesis that a richer representation for answers can store more information.
* Proposed approach outperforms the baseline methods but is outperformed by an ensemble of proposed approach with semantic parsing via paraphrasing model.

4

[link] Summary by Apoorva Shetty 5 years ago

Although Machine learning models have been accepted widely as the next step towards simplifying complex problems, the inner workings of a machine learning model are still unclear and these details can lead to an increase in trust of the model prediction, and the model itself. 

**Idea: ** A good explanation system that can justify the prediction of a classifier and can lead to diagnosing the reasoning behind a model can exponentially raise one’s trust in the predictive model.

**Solution: ** This paper proposes a local explanation model called LIME, that approximates a linear local explanation with respect to a data point. The paper outlines desired characteristics for explainers and expounds on how LIME matches to these characteristics, the characteristics being 1) Interpretable 2) Local Fidelity 3) Model-Agnostic and 4) Provides a global perspective. This paper also explores the concept of Fidelity-Interpretability Trade-off; The more complex a model is the less interpretable a completely faithful explanation would be, thus a balance needs to be struck between interpretability and fidelity for complex models. The paper outlines in detail how the proposed LIME explanation model works, for different types of predictive classifiers. LIME works by generating random data points around a test data point and approximating a linear explanation for these randomized points. Thus, LIME works on a rather large assumption that every complex model is linear on a microscopic level. This assumption although large seems justified for most models, although this could lead to certain global issues when analyzing a complex model on the whole.

4

[link] Summary by CodyWild 5 years ago

In the years before this paper came out in 2017, a number of different graph convolution architectures - which use weight-sharing and order-invariant operations to create representations at nodes in a graph that are contextualized by information in the rest of the graph - had been suggested for learning representations of molecules. The authors of this paper out of Google sought to pull all of these proposed models into a single conceptual framework, for the sake of better comparing and testing the design choices that went into them. All empirical tests were done using the QM9 dataset, where 134,000 molecules have predicted chemical properties attached to them, things like the amount of energy released if bombs are sundered and the energy of electrons at different electron shells. 

https://i.imgur.com/Mmp8KO6.png

An interesting note is that these properties weren't measured empirically, but were simulated by a very expensive quantum simulation, because the former wouldn't be feasible for this large of a dataset. However, this is still a moderately interesting test because, even if we already have the capability to computationally predict these features, a neural network would do much more quickly. And, also, one might aspirationally hope that architectures which learn good representations of molecules for quantum predictions are also useful for tasks with a less available automated prediction mechanism.

The framework assumes the existence of "hidden" feature vectors h at each node (atom) in the graph, as well as features that characterize the edges between nodes (whether that characterization comes through sorting into discrete bond categories or through a continuous representation). The features associated with each atom at the lowest input level of the molecule-summarizing networks trained here include: the element ID, the atomic number, whether it accepts electrons or donates them, whether it's in an aromatic system, and which shells its electrons are in.  

https://i.imgur.com/J7s0q2e.png

Given these building blocks, the taxonomy lays out three broad categories of function, each of which different architectures implement in slightly different ways. 

1. The Message function, M(). This function is defined with reference to a node w, that the message is coming from, and a node v, that it's being sent to, and is meant to summarize the information coming from w to inform the node representation that will be calculated at v. It takes into account the feature vectors of one or both nodes at the next level down, and sometimes also incorporates feature vectors attached to the  edge connecting the two nodes. In a notable example of weight sharing, you'd use the same Message function for every combination of v and w, because you need to be able to process an arbitrary number of pairs, with each v having a different number of neighbors.  The simplest example you might imagine here is a simple concatenation of incoming node and edge features; a more typical example from the architectures reviewed is a concatenation followed by a neural network layer. The aggregate message being sent to the receiver node is calculated by summing together the messages from each incoming vector (though it seems like other options are possible; I'm a bit confused why the paper presented summing as the only order-invariant option). 
2. The Update function, U(). This function governs how to take the aggregated message vector sent to a particular node, and combine that with the prior-layer representation at that node, to come up with a next-layer representation at that node. Similarly, the same Update function weights are shared across all atoms. 
3. The Readout function, R(), which takes the final-layer representation of each atom node and aggregates the representations into a final graph-level representation an order-invariant way 

Rather than following in the footsteps of the paper by describing each proposed model type and how it can be described in this framework, I'll instead try to highlight some of the more interesting ways in which design choices differed across previously proposed architectures. 

- Does the message function being sent from w to v depend on the feature value at both w and v, or just v? To put the question more colloquially, you might imagine w wanting to contextually send different information based on different values of the feature vector at node v, and this extra degree of expressivity (not present in the earliest 2015 paper), seems like a quite valuable addition (in that all subsequent papers include it)
- Are the edge features static, categorical things, or are they feature vectors that get iteratively updated in the same way that the node vectors do? For most of the architectures reviewed, the former is true, but the authors found that the highest performance in their tests came from networks with continuous edge vectors, rather than just having different weights for different category types of edge
- Is the Readout function something as simple as a summation of all top-level feature vectors, or is it more complex? Again, the authors found that they got the best performance by using a more complex approach, a Set2Set aggregator, which uses item-to-item attention within the set of final-layer atom representations to construct an aggregated grap-level embedding

The empirical tests within the paper highlight a few more interestingly relevant design choices that are less directly captured by the framework. The first is the fact that it's quite beneficial to explicitly include Hydrogen atoms as part of the graph, rather than just "attaching" them to their nearest-by atoms as a count that goes on that atom's feature vector. The second is that it's valuable to start out your edge features with a continuous representation of the spatial distance between atoms, along with an embedding of the bond type. 

This is particularly worth considering because getting spatial distance data for a molecule requires solving the free-energy problem to determine its spatial conformation, a costly process. We might ideally prefer a network that can work on bond information alone. The authors do find a non-spatial-information network that can perform reasonably well - reaching full accuracy on 5 of 13 targets, compared to 11 with spatial information. However, the difference is notable, which, at least from my perspective, begs the question of whether it'd ever be possible to learn representations that can match the performance of spatially-informed ones without explicitly providing that information.

4

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a method for "learning the learning rate" of a stochastic gradient descent method, in the context of online learning. Indeed, variations on the chosen learning rate or learning rate schedule can have a large impact in observed performance of stochastic gradient descent. Moreover, in the context of online learning, where we are interested in achieving high performance not only at convergence but every step of the way, the "choosing the learning rate" problem is even more crucial.

The authors present a method which attempts to train the learning rate itself by gradient descent. This is achieved by "unrolling" the parameter updates of our model across the time steps of online learning, which exposes the interaction between the learning rate and the sum of losses of the model across these time steps. The authors then propose a way to approximate the gradient of the sum of losses with respect to the learning rate, so that it can be used to perform gradient updates on the learning rate itself.

The gradient on the learning rate has to be approximated, for essentially the same reason that gradients to train a recurrent neural network online must be approximated (see also my notes on another good paper by Yann Ollivier here: \cite{journals/corr/OllivierC15}). Another approximation is introduced to avoid having to compute an Hessian matrix. Nevertheless, results suggest that the proposed approximation works well and can improve over a fixed learning with a reasonable rate decay schedule

#### My two cents

I think the authors are right on the money as to the challenges posed by online learning. I think these challenges are likely to be greater in the context of training neural networks online, for which little satisfactory solutions exist right now. So this is a direction of research I'm particularly excited about.

At this points, the experiments consider fairly simple learning scenarios, but I don't see any obstacle in applying the same method to neural networks. One interesting observation from the results is that results are fairly robust to variations of "the learning rate of the learning rate", compared to varying and fixing the learning rate itself.

Finally, I haven't had time to entirely digest one of their theoretical result, suggesting that their approximation actually corresponds to an exact gradient taken "alongside the effective trajectory" of gradient descent. However, that result seems quite interesting and would deserve more attention.