Efficient estimation of word representations in vector space on ShortScience.org

scholar.google.com

Efficient estimation of word representations in vector space
Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey
arXiv preprint arXiv:1301.3781 - 2013 via Local Bibsonomy
Keywords: thema:deepwalk, language, modelling, skipgram

Summaries/Notes 1

[link] Summary by Shagun Sodhani 9 years ago

## Introduction

* Introduces techniques to learn word vectors from large text datasets.
* Can be used to find similar words (semantically, syntactically, etc).
* [Link to the paper](http://arxiv.org/pdf/1301.3781.pdf)
* [Link to open source implementation](https://code.google.com/archive/p/word2vec/)

## Model Architecture

* Computational complexity defined in terms of a number of parameters accessed during model training.
* Proportional to $E*T*Q$
* *E* - Number of training epochs
* *T* - Number of words in training set
* *Q* - depends on the model

### Feedforward Neural Net Language Model (NNLM)

* Probabilistic model with input, projection, hidden and output layer.
* Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size).
* Input layer projected to projection layer P with dimensionality *N\*D*
* Hidden layer (of size *H*) computes the probability distribution over all words.
* Complexity per training example $Q =N*D + N*D*H + H*V$
* Can reduce *Q* by using hierarchical softmax and Huffman binary tree (for storing vocabulary).

### Recurrent Neural Net Language Model (RNNLM)

* Similar to NNLM minus the projection layer.
* Complexity per training example $Q =H*H + H*V$
* Hierarchical softmax and Huffman tree can be used here as well.

## Log-Linear Models

* Nonlinear hidden layer causes most of the complexity.
* NNLMs can be successfully trained in two steps:
* Learn continuous word vectors using simple models.
* N-gram NNLM trained over the word vectors.

### Continuous Bag-of-Words Model

* Similar to feedforward NNLM.
* No nonlinear hidden layer.
* Projection layer shared for all words and order of words does not influence projection.
* Log-linear classifier uses a window of words to predict the middle word.
* $Q = N*D + D*\log_2V$

### Continuous Skip-gram Model

* Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window.
* Distant words are given less weight by sampling fewer distant words.
* $Q = C*(D + D*log_2 V$) where *C* is the max distance of the word from the middle word.
* Given a *C* and a training data, a random *R* is chosen in range *1 to C*.
* For each training word, *R* words from history (previous words) and *R* words from future (next words) are marked as target output and model is trained.

## Results

* Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece).
* Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance.
* Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge.
* Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman").

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private