## Introduction
* Introduces techniques to learn word vectors from large text datasets.
* Can be used to find similar words (semantically, syntactically, etc).
* [Link to the paper](http://arxiv.org/pdf/1301.3781.pdf)
* [Link to open source implementation](https://code.google.com/archive/p/word2vec/)
## Model Architecture
* Computational complexity defined in terms of a number of parameters accessed during model training.
* Proportional to $E*T*Q$
* *E* - Number of training epochs
* *T* - Number of words in training set
* *Q* - depends on the model
### Feedforward Neural Net Language Model (NNLM)
* Probabilistic model with input, projection, hidden and output layer.
* Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size).
* Input layer projected to projection layer P with dimensionality *N\*D*
* Hidden layer (of size *H*) computes the probability distribution over all words.
* Complexity per training example $Q =N*D + N*D*H + H*V$
* Can reduce *Q* by using hierarchical softmax and Huffman binary tree (for storing vocabulary).
### Recurrent Neural Net Language Model (RNNLM)
* Similar to NNLM minus the projection layer.
* Complexity per training example $Q =H*H + H*V$
* Hierarchical softmax and Huffman tree can be used here as well.
## Log-Linear Models
* Nonlinear hidden layer causes most of the complexity.
* NNLMs can be successfully trained in two steps:
* Learn continuous word vectors using simple models.
* N-gram NNLM trained over the word vectors.
### Continuous Bag-of-Words Model
* Similar to feedforward NNLM.
* No nonlinear hidden layer.
* Projection layer shared for all words and order of words does not influence projection.
* Log-linear classifier uses a window of words to predict the middle word.
* $Q = N*D + D*\log_2V$
### Continuous Skip-gram Model
* Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window.
* Distant words are given less weight by sampling fewer distant words.
* $Q = C*(D + D*log_2 V$) where *C* is the max distance of the word from the middle word.
* Given a *C* and a training data, a random *R* is chosen in range *1 to C*.
* For each training word, *R* words from history (previous words) and *R* words from future (next words) are marked as target output and model is trained.
## Results
* Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece).
* Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance.
* Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge.
* Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman").