[link]
#### Introduction * Introduces a new global log-bilinear regression model which combines the benefits of both global matrix factorization and local context window methods. #### Global Matrix Factorization Methods * Decompose large matrices into low-rank approximations. * eg - Latent Semantic Analysis (LSA) ##### Limitations * Poor performance on word analogy task * Frequent words contribute disproportionately high to the similarity measure. #### Shallow, Local Context-Based Window Methods * Learn word representations using adjacent words. * eg - Continous bag-of-words (CBOW) model and skip-gram model. ##### Limitations * Since they do not operate directly on the global co-occurrence counts, they can not utilise the statistics of the corpus effectively. #### GloVe Model * To capture the relationship between words $i$ and $j$, word vector models should use ratios of co-occurene probabilites (with other words $k$) instead of using raw probabilites themselves. * In most general form: * $F(w_{i}, w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * We want $F$ to encode information in the vector space (which have a linear structure), so we can restrict to the difference of $w_{i}$ and $w_{j}$ * $F(w_{i} - w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * Since right hand side is a scalar and left hand side is a vector, we take dot product of the arguments. * $F( (w_{i} - w_{j})^{T}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * *F* should be invariant to order of the word pair $i$ and $j$. * $F(w_{i}^{T}w_{k}^{~}) = P_{ik}$ * Doing further simplifications and optimisations (refer paper), we get cost function, * $J = \sum_{\text{over all i, j pairs in the vocabulary}}[w_{i}^{T}w_{k}^{~} + b_{i} + b_{k}^{~} - log(X_{ik})]^{2}$ * $f$ is a weighing function. * $f(x) = min((x/x_{max})^{\alpha}, 1)$ * Typical values, $x_{\max} = 100$ and $\alpha = 3/4$ * *b* are the bias terms. ##### Complexity * Depends on a number of non-zero elements in the input matrix. * Upper bound by the square of vocabulary size * Since for shallow window-based approaches, complexity depends on $|C|$ (size of the corpus), tighter bounds are needed. * By modelling number of co-occurrences of words as power law function of frequency rank, the complexity can be shown to be proportional to $|C|^{0.8}$ #### Evaluation ##### Tasks * Word Analogies * a is to b as c is to ___? * Both semantic and syntactic pairs * Find closest d to $w_{b} - w_{c} + w_{a}$ (using cosine similarity) * Word Similarity * Named Entity Recognition ##### Datasets * Wikipedia Dumps - 2010 and 2014 * Gigaword5 * Combination of Gigaword5 and Wikipedia2014 * CommonCrawl * 400,000 most frequent words considered from the corpus. ##### Hyperparameters * Size of context window. * Whether to distinguish left context from right context. * $f$ - Word pairs that are $d$ words apart contribute $1/d$ to the total count. * $xmax = 100$ * $\alpha = 3/4$ * AdaGrad update ##### Models Compared With * Singular Value Decomposition * Continous Bag-Of-Words * Skip-Gram ##### Results * Glove outperforms all other models significantly. * Diminishing returns for vectors larger than 200 dimensions. * Small and asymmetric context windows (context window only to the left) works better for syntactic tasks. * Long and symmetric context windows (context window to both the sides) works better for semantic tasks. * Syntactic task benefited from larger corpus though semantic task performed better with Wikipedia instead of Gigaword5 probably due to the comprehensiveness of Wikipedia and slightly outdated nature of Gigaword5. * Word2vec’s performance decreases if the number of negative samples increases beyond about 10. * For the same corpus, vocabulary, and window size GloVe consistently achieves better results, faster.
Your comment:
|
[link]
Stanford’s paper on Global Vectors for Word Representation proposes one of few extremely popular word embedding methods in NLP. GloVe takes advantage of both global corpus statistics and local context window methods by constructing the word-context co-occurrence matrix and reducing its dimensionality by preserving as much of the variance as possible. It builds a feature space with additive compositionality while preserving statistically meaningful word occurrence information extracted from the data corpus. Authors start with building the counts matrix X, where X_ij is the number of times word w_j appears in the context w_i. The corresponding probabilities can be calculated as P_ij = X_ij/X_i. The GloVe model tries to fit a function F(w_i, w_j, w.hat_k) = P_ik/Pjk, which represents ratio of probabilities of the word w.hat_k appearing in the context of words w_i and w_j respectively. The model is expected to produce large output for w.hat_k relevant to w_i and not w_j and vice versa. When w.hat_k is equivalently relevant or irrelevant to w_i and w_j, the output is close to unity. For practical reasons, the authors simplified the model with a linear equation w_i.T*w.hat_k + b_i + b_j = log(1+X_ik), where the biases b_i, b_j represent log(1+X_i) and log(1+X_j) respectively. As an optimization objective, GloVe uses weighted squared error J = Sum_ij(f(X_ij)*(w_i.T*w.hat_k + b_i + b_j - log(1+X_ik))^2), where the weighting function mitigates the effect of noisy rare word occurrences (X_ij ~ 0). The choice of the weighting function f(x) is somewhat arbitrary and driven by empirical observations. The best performing f(x) = min{1, (x/x_max)^0.75}, with x_max = 100. Authors refer to 0.75 power of the skip-gram model, which appears to be used in negative sampling distribution in skip-gram model, which also functions as a weighting coefficient. The resulting log-bilinear regression model is then optimized with AdaGrad and the combination of the resulting matrices W + W.hat is used as the word embeddings. Authors evaluate GloVe performance on a variety of NLP tasks and compare against Skip-Gram, CBOW, SVD and HPCA. Although GloVe demonstrated advantageous metrics in both performance and training time (co-occurrence matrix is built in parallel), the model is not very different from Word2Vec. Another questionable argument of the GloVe authors is that the log squared error function is superior to the cross entropy loss in Skip-Gram and ivLBL models, because of higher robustness in the long tails distributions. However, it is not apparently correct, as Skip-Gram uses stochastic optimization, which inherently mitigates the long tails vulnerability. In fact, GloVe shows similar performance to Word2Vec in numerous NLP problems, yet the later historically gained a larger popularity. |