[link]
#### Introduction * Introduces a new global log-bilinear regression model which combines the benefits of both global matrix factorization and local context window methods. #### Global Matrix Factorization Methods * Decompose large matrices into low-rank approximations. * eg - Latent Semantic Analysis (LSA) ##### Limitations * Poor performance on word analogy task * Frequent words contribute disproportionately high to the similarity measure. #### Shallow, Local Context-Based Window Methods * Learn word representations using adjacent words. * eg - Continous bag-of-words (CBOW) model and skip-gram model. ##### Limitations * Since they do not operate directly on the global co-occurrence counts, they can not utilise the statistics of the corpus effectively. #### GloVe Model * To capture the relationship between words $i$ and $j$, word vector models should use ratios of co-occurene probabilites (with other words $k$) instead of using raw probabilites themselves. * In most general form: * $F(w_{i}, w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * We want $F$ to encode information in the vector space (which have a linear structure), so we can restrict to the difference of $w_{i}$ and $w_{j}$ * $F(w_{i} - w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * Since right hand side is a scalar and left hand side is a vector, we take dot product of the arguments. * $F( (w_{i} - w_{j})^{T}, w_{k}^{~} ) = P_{ik}/P_{jk}$ * *F* should be invariant to order of the word pair $i$ and $j$. * $F(w_{i}^{T}w_{k}^{~}) = P_{ik}$ * Doing further simplifications and optimisations (refer paper), we get cost function, * $J = \sum_{\text{over all i, j pairs in the vocabulary}}[w_{i}^{T}w_{k}^{~} + b_{i} + b_{k}^{~} - log(X_{ik})]^{2}$ * $f$ is a weighing function. * $f(x) = min((x/x_{max})^{\alpha}, 1)$ * Typical values, $x_{\max} = 100$ and $\alpha = 3/4$ * *b* are the bias terms. ##### Complexity * Depends on a number of non-zero elements in the input matrix. * Upper bound by the square of vocabulary size * Since for shallow window-based approaches, complexity depends on $|C|$ (size of the corpus), tighter bounds are needed. * By modelling number of co-occurrences of words as power law function of frequency rank, the complexity can be shown to be proportional to $|C|^{0.8}$ #### Evaluation ##### Tasks * Word Analogies * a is to b as c is to ___? * Both semantic and syntactic pairs * Find closest d to $w_{b} - w_{c} + w_{a}$ (using cosine similarity) * Word Similarity * Named Entity Recognition ##### Datasets * Wikipedia Dumps - 2010 and 2014 * Gigaword5 * Combination of Gigaword5 and Wikipedia2014 * CommonCrawl * 400,000 most frequent words considered from the corpus. ##### Hyperparameters * Size of context window. * Whether to distinguish left context from right context. * $f$ - Word pairs that are $d$ words apart contribute $1/d$ to the total count. * $xmax = 100$ * $\alpha = 3/4$ * AdaGrad update ##### Models Compared With * Singular Value Decomposition * Continous Bag-Of-Words * Skip-Gram ##### Results * Glove outperforms all other models significantly. * Diminishing returns for vectors larger than 200 dimensions. * Small and asymmetric context windows (context window only to the left) works better for syntactic tasks. * Long and symmetric context windows (context window to both the sides) works better for semantic tasks. * Syntactic task benefited from larger corpus though semantic task performed better with Wikipedia instead of Gigaword5 probably due to the comprehensiveness of Wikipedia and slightly outdated nature of Gigaword5. * Word2vec’s performance decreases if the number of negative samples increases beyond about 10. * For the same corpus, vocabulary, and window size GloVe consistently achieves better results, faster.
Your comment:
|