Glove: Global Vectors for Word Representation on ShortScience.org

aclweb.org
scholar.google.com

Glove: Global Vectors for Word Representation
Pennington, Jeffrey and Socher, Richard and Manning, Christopher D.
Empirical Methods on Natural Language Processing (EMNLP) - 2014 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 2

[link] Summary by Ablaikhan Akhazhanov 6 years ago

Stanford’s paper on Global Vectors for Word Representation proposes one of few extremely popular word embedding methods in NLP. GloVe takes advantage of both global corpus statistics and local context window methods by constructing the word-context co-occurrence matrix and reducing its dimensionality by preserving as much of the variance as possible. It builds a feature space with additive compositionality while preserving statistically meaningful word occurrence information extracted from the data corpus.

Authors start with building the counts matrix X, where X_ij is the number of times word w_j appears in the context w_i. The corresponding probabilities can be calculated as P_ij = X_ij/X_i. The GloVe model tries to fit a function F(w_i, w_j, w.hat_k) = P_ik/Pjk, which represents ratio of probabilities of the word w.hat_k appearing in the context of words w_i and w_j respectively. The model is expected to produce large output for w.hat_k relevant to w_i and not w_j and vice versa. When w.hat_k is equivalently relevant or irrelevant to w_i and w_j, the output is close to unity. For practical reasons, the authors simplified the model with a linear equation w_i.T*w.hat_k + b_i + b_j = log(1+X_ik), where the biases b_i, b_j represent log(1+X_i) and log(1+X_j) respectively. As an optimization objective, GloVe uses weighted squared error J = Sum_ij(f(X_ij)*(w_i.T*w.hat_k + b_i + b_j - log(1+X_ik))^2), where the weighting function mitigates the effect of noisy rare word occurrences (X_ij ~ 0). The choice of the weighting function f(x) is somewhat arbitrary and driven by empirical observations. The best performing f(x) = min{1, (x/x_max)^0.75}, with x_max = 100. Authors refer to 0.75 power of the skip-gram model, which appears to be used in negative sampling distribution in skip-gram model, which also functions as a weighting coefficient. The resulting log-bilinear regression model is then optimized with AdaGrad and the combination of the resulting matrices W + W.hat is used as the word embeddings.

Authors evaluate GloVe performance on a variety of NLP tasks and compare against Skip-Gram, CBOW, SVD and HPCA. Although GloVe demonstrated advantageous metrics in both performance and training time (co-occurrence matrix is built in parallel), the model is not very different from Word2Vec. Another questionable argument of the GloVe authors is that the log squared error function is superior to the cross entropy loss in Skip-Gram and ivLBL models, because of higher robustness in the long tails distributions. However, it is not apparently correct, as Skip-Gram uses stochastic optimization, which inherently mitigates the long tails vulnerability. In fact, GloVe shows similar performance to Word2Vec in numerous NLP problems, yet the later historically gained a larger popularity.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private