[link]
Stanford’s paper on Global Vectors for Word Representation proposes one of few extremely popular word embedding methods in NLP. GloVe takes advantage of both global corpus statistics and local context window methods by constructing the word-context co-occurrence matrix and reducing its dimensionality by preserving as much of the variance as possible. It builds a feature space with additive compositionality while preserving statistically meaningful word occurrence information extracted from the data corpus. Authors start with building the counts matrix X, where X_ij is the number of times word w_j appears in the context w_i. The corresponding probabilities can be calculated as P_ij = X_ij/X_i. The GloVe model tries to fit a function F(w_i, w_j, w.hat_k) = P_ik/Pjk, which represents ratio of probabilities of the word w.hat_k appearing in the context of words w_i and w_j respectively. The model is expected to produce large output for w.hat_k relevant to w_i and not w_j and vice versa. When w.hat_k is equivalently relevant or irrelevant to w_i and w_j, the output is close to unity. For practical reasons, the authors simplified the model with a linear equation w_i.T*w.hat_k + b_i + b_j = log(1+X_ik), where the biases b_i, b_j represent log(1+X_i) and log(1+X_j) respectively. As an optimization objective, GloVe uses weighted squared error J = Sum_ij(f(X_ij)*(w_i.T*w.hat_k + b_i + b_j - log(1+X_ik))^2), where the weighting function mitigates the effect of noisy rare word occurrences (X_ij ~ 0). The choice of the weighting function f(x) is somewhat arbitrary and driven by empirical observations. The best performing f(x) = min{1, (x/x_max)^0.75}, with x_max = 100. Authors refer to 0.75 power of the skip-gram model, which appears to be used in negative sampling distribution in skip-gram model, which also functions as a weighting coefficient. The resulting log-bilinear regression model is then optimized with AdaGrad and the combination of the resulting matrices W + W.hat is used as the word embeddings. Authors evaluate GloVe performance on a variety of NLP tasks and compare against Skip-Gram, CBOW, SVD and HPCA. Although GloVe demonstrated advantageous metrics in both performance and training time (co-occurrence matrix is built in parallel), the model is not very different from Word2Vec. Another questionable argument of the GloVe authors is that the log squared error function is superior to the cross entropy loss in Skip-Gram and ivLBL models, because of higher robustness in the long tails distributions. However, it is not apparently correct, as Skip-Gram uses stochastic optimization, which inherently mitigates the long tails vulnerability. In fact, GloVe shows similar performance to Word2Vec in numerous NLP problems, yet the later historically gained a larger popularity. |
[link]
A fundamental paper by Adam Berger and his colleagues at IBM Watson Research Center presents a thorough guide on how to apply maximum entropy (ME) modelling in NLP. Authors follow the principle of maximum entropy and show the optimality of their procedure as well as its duality relationship with the maximum log-likelihood estimation. Importantly, the paper proposes a method for automatic feature selection, perhaps the most critical step in the entire approach. Empirical results from Candide, an IBM’s automatic machine-translation system, demonstrate the capabilities of ME models. Both originality and wide range of possible applications made the paper to stand out and led to the deserved accolades. Many practical NLP problems impose constraints in terms of data or explicit statements, while requiring minimal assumption on the underlying distributions. Laplace advocated that in this scenario the best strategy is to treat everything as equal as possible, while being consistent with the observed facts. Authors follow Laplace’s philosophy and propose to use entropy as a measure of uniformly distributed uncertainty. They employ the principle of maximum entropy to choose the best performing model, which simply states that the optimal model is the one that has the largest entropy. For a given set of features f they model its expected value using the empirical data distribution p(x,y) as p(f) = sum[p(x,y)*f(x,y)] = sum[p(y|x)*p(x)*f(x,y)]. By replacing the conditional probability p(y|x) with its entropy, they obtain the objective function H(p) = -sum[p(x)*p(y|x)*log(p(y|x))]. The optimal distribution p* is then defined as p* = argmax(H(p)). Note that this optimization usually requires a numerical solution. Therefore, authors employ Lagrange multipliers L(p, lambda) = H(p) + sum[lambda_i * (p(f_i) – p_tilda(f_i))], which yields p_lambda(y|x) = 1/Z * exp(sum[lambda_i * f_i(x,y)]) and L(lambda) = -sum[p(x)*log(Z)] + sum[lambda_i*p(f_i)], where Z is a normalizing constant. The unconstraint dual problem then becomes lambda = argmax(L(lambda)). Interestingly, that L is exactly equal to the log-likelihood of the data sample, which implies that MLE and ME are dual and ensures the optimality of ME, since it converges to the best fit of the data. To estimate the model parameters, authors proposed an iterative scaling method, which update lambda by solving sum[p(x)*p(y|x)*f_i(x,y)*exp(delta_lambda * f(x,y))] = p(f_i). The choice of features f is apparently the most critical step in the procedure. Authors propose a simple approach of feature selection by iteratively adding a feature f_new that maximizes the increase in entropy. They continue expanding the set of features until the cross-validation results converge. To evaluate ME approach, the authors applied it to context-sensitive word-translation, sentence segmentation, and word reordering problems. These applications demonstrate the efficacy of ME techniques for performing context-sensitive modelling and the paper undoubtedly deserves the attention it gained. For example, in word reordering the model outperform a simple baseline model by 10%. However, a deeper analysis of the feature selection could benefit it. The proposed feature selection method is computationally expensive, as it iteratively calls optimization subroutine and involves a cross-validation step. Moreover, authors skip the discussion on overfitting and the choice of the initial pool of the features. They imply that the features have to be designed manually, which makes the approach impractical for large diverse problems. Therefore, further developments in dimensionality reduction and feature extraction could benefit ME. |
[link]
The paper written by an international collaboration between UPenn and CUNI presents an original parsing approach that expands the conventional projective tree-based method to include the non-projective dependencies in text. Authors represent words and their relationships as vertices and edges of a complete directed graph respectively, and then employ Chu-Liu-Edmonds algorithm to find the maximum spanning trees allowing non-projective parses. Importantly, they demonstrated better accuracy for Czech language and higher efficiency (O(n^2)) compared to Eisner’s projective parsing (O(n^3)). Generality and unexpected asymptotical computational simplicity of the proposed approach attracted numerous researchers and led to the best student paper award on HLT/EMNLP conference in 2005. Conventionally, finding a maximum projective spanning tree (MST) corresponds to finding a maximum dependency tree, which could be solved by Chu-Liu-Edmonds or Eisner algorithms. To find the solution, authors define a score s(x,y) = sum[w * f(i,j)], where (i,j) is an edge in a dependency tree y corresponding to a sentence {x}, f is a feature vector, and w is the weight vector that needs to be optimized. To extend the procedure to non-projective trees, authors apply a greedy Chu-Liu-Edmonds algorithm, where for each vertex it leaves only the incoming edge with the highest score. If the resulting graph has cycles, they are replaced with vertices, and the algorithms iterates until it converges to a tree, which must be the MST. Tarjan’s efficient implementation of the algorithms guarantees O(n^2), which dispels the notion that non-projective parsing is “harder” than projective parsing that takes O(n^3). To optimize weight vector w, authors employ factored version of Margin Infused Relaxed Algorithm (MIRA), which iteratively updates w for each sample (x_t,y_t) subject to min||w_t+1 – w_t|| s.t. s(l, j) – s(k, k) >= 1 for (l,j) in y_t and (k,j) not in y_t. To evaluate the proposed approach, the paper shows dependency parsing results for Czech, using the Prague Dependency Treebank that has both projective parsing and non-projective parsing tasks. The method achieves the state of the art performance for the dataset and promises an outstanding generality for numerous other languages and low computation cost. However, the paper vaguely defines the high-dimensional feature vectors f(i,j) that must play a crucial role in the optimization of the weight vector w. Moreover, lower accuracy and completeness of the proposed method for English requires a closer look at the optimization procedure. Finally, as the authors mentioned in the paper, the non-projective parsing does not generalize well to bigger substructures as the search becomes intractable. |
[link]
The paper presents the famously known Word2Vec model, which became ubiquitous in numerous NLP applications partially owing to its linear feature space with additive compositionality. To be fair, the paper is an extension to the previously presented work by Tomas Mikolov and his colleagues on distributed representation of words and phrases. The proposed approach is based on the skip-gram model and introduces four novel methods to significantly improve training speed and performance. Particularly, the approach is effective with frequent and rare words and phrases, including idiomatic phrases. Skip-gram model uses softmax function to define P(w_O | w_I), which becomes impractical when training on a large vocabulary due to expensive calculation of the denominator. To tackle it, the authors employ binary Huffman tree of words and calculate P(w_O | w_I) as a product of the sigmoids along the path from the root to any word w. The argument to the sigmoids is a product of the indicator function and the dot product between the vector representations of the words. The indicator flags whether a word is in the path from the root node to the word w. Tree structure simplifies the calculation from O(W) to average O(logW). Additionally, since Huffman tree assigns short codes to the frequent words, the training gets an extra speed up. As an alternative to tree-based hierarchical softmax, the authors introduce an extension to Noise Contrastive Estimation (NCE), which they call Negative Sampling. The idea is to use logistic binary regression to distinguish between negative and positive samples, rather than picking one class from the entire vocabulary. They sample k negative examples from a “noise distribution” Pn for each positive example. As Pn authors employ unigram distribution U(w)^(3/4)/Z, which works best for both Negative Sampling and Hierarchical Softmax models, although they do not elaborate on the choice of the noise distribution. To balance frequent and rare words, authors aggressively subsample words whose frequency is greater than a chosen threshold while preserving the ranking of the frequencies. They demonstrate that a heuristically chosen subsampling strategy accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words. They discard a word w with probability p(w) = 1 – sqrt(t/f(w)), where t is a manually selected threshold (e.g. 10^-5) and f(w) is the word frequency. Finally, authors learn vector representation of the phrases by combining words into individual tokens. They score bigrams as score(w_i,w_j) = (cnt(w_i, w_j)-d)/(cnt(w_i) – cnt(w_j)) and then choose the ones that have scores above the chosen threshold. To form longer phrases, authors evaluate the scores recursively. The chosen score metric combines words that appear frequently together, and infrequently in other contexts. Although the paper lacks detailed discussion on critical choice of the subsampling and negative sampling, it creates a foundation for explainable linear feature space in NLP model. The model shows the state-of-the-art performance and significantly outperforms other approaches, though it is partially due to a much bigger size of the training dataset. One of the most outstanding results of the paper is the inherent additive compositionality. This directly comes from the use of the softmax that tries to maximize the product of the probabilities. It forces the word vectors to have a linear relationship with the inputs to the softmax and decreases the distance of between the words that appear lose to each other. As a results, the word vectors can be combined as follows: “Paris” – “France” + “Germany” = “Berlin”. |
[link]
Yoshua Bengio’s visionary work on probabilistic language modelling had a huge impact on the field of Natural Language Processing. Despite that it was published nearly 20 years ago, it is still relevant to modern NLP solutions. In fact, the subsequent works perfected the state-of-the-art in NLP, although it took more than 10 years for the paper to get a significant attention in the field. The authors approach the fundamental problem of exponentially growing number of trainable parameters with the size of corpora. The problem is called “Curse of Dimensionality” and limits the generalization performance of conventional approaches such as N-gram models. For instance, if we want to model a joint distribution of 10 consecutive words with a vocabulary of size 100,000, there are potentially 10^50-1 trainable variables. The proposed probabilistic modeling approach scales linearly with the size of the vocabulary. It comprises of two steps: 1) converting words into real-valued feature vector of size m and 2) extract joint probability of the word sequence represented by the feature vectors (total size of (n-1)-by-m). While the first step is implemented as a trainable matrix of size |V|*m, the joint probability function is implemented as two densely connected layers (tanh and softmax activation) with a skip connection between the input of the first layer and the input of the second layers. The first layer has weight matrix of size m*h and the second is (nm+h)*|V|, which results in total of |V|(1+nm+h)+h(1+(n-1)m) trainable variables. The output of the model is then mixed with the output of a tri-gram model. As the optimization objective, the authors employ log-likelihood of the next word regularized on the weights of dense layers (biases are excluded, as well as, matrix of real-valued word features embedding matrix of size |V|-by-m). The goal of the optimization is to find the parameters that minimize the perplexity of the training dataset. Eventually we learn the distributed representations of the words and the probability function of a sequence as a function of the distributed word representations. The proposed model improved the out-of-sample perplexity score by 24% for Brown and 8% on AP news datasets compared to the state-of-the-art smoothed trigram models. The best performing architecture comprised h=100, m=30, without a skip connection, but with trigram mixing. As a primary drawback of the approach, authors mention significant speed limitations and propose use of shallow networks, time delay and recurrent neural networks to mitigate the problem and improve the performance. They also consider several promising direction for future research such as decomposing the network in sub-networks, representing the conditional probability as a tree, introducing a-priori knowledge, interpretable word embedding, and adjusting the gradient propagation path. As a critique, the training did not explicitly target word semantics; although the authors claim that their embedding takes advantage of the inherently learned similarity (close words are expected to have similar feature vectors). For example, a better approach could be a Siamese network with an objective to minimize a distance (e.g. cosine or Euclidian) between “similar” words, this would also decrease the number of the trainable variables and well as exclude the risky non-regularized word embedding matrix. Secondly, as the authors mentioned, the model could significantly benefit from use of GRU, RNN, LSTM and other network architectures, as well as including prior knowledge. Finally, the model scales linearly with n and |V|, which significantly limits its application. |