[link]
# Improving Word Representations via Global Context and Multiple Word Prototypes ## Introduction * This paper pre-dated papers like Glove and Word2Vec and proposed an architecture that * combines local and global context while learning word embeddings to capture the word semantics. * learns multiple embeddings per word to account for homonymy and polysemy. * [Link to the paper](http://www.aclweb.org/anthology/P12-1092) ## Global Context-Aware Neural Language Model ### Training Objective * Given a word sequence *s* (local context) and a document *d* in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in *s* from other words. * *g(s, d)* - scoring function giving liklihood of correct sequence. * *g(s<sup>w</sup>, d)* - scoring function giving liklihood of last word in *s* repalced by a word *w*. * Objective - *g(s, d)* > *g(s<sup>w</sup>, d)* + 1 for any other word *w*. ### Architecture * Two scoring components (neural networks) to capture: * Local Context * Map word sequence *s* into an ordered list of vectors *x = [x<sub>1</sub>, ..., x<sub>m</sub>]*. * *x<sub>i</sub>* - embedding corresponding to *i<sup>th</sup>* word in the sequence. * Compute local score *score<sub>l</sub>* by using a neural network (with one hidden layer) over *x*. * Preserves word order and syntactic information. * Global Context * Map document *d* to an ordered list of word embeddings, *d = (d<sub>1</sub>, ..., d<sub>k</sub>)*. * Compute *c*, the weighted average of all word vectors in document. * The paper uses *idf* score for weighting the documents. * *x = * concatenation of *c* and vector of the last word in *s*. * Compute global score *score<sub>g</sub>* by using a neural network (with two hidden layers) over *x*. * Similar to bag-of-words features. *score = score<sub>l</sub> + score<sub>g</sub>* * Train the weights of the hidden layers and the word embeddings. ### Multi-Prototype Neural Language Model * Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word. * Solution - train multiple vectors per word to capture the different meanings. * Approach * Gather all the fixed-sized context windows for all occurrences of a given word. * Find the context vector by performing weighted averaging of all the words in the context window. * Cluster the context vectors using spherical k-means. * Each word occurrence in the corpus is re-labeled to its associated cluster. * To find similarity between a pair of words *(w, w')*: * For each possible cluster of *i* and *j* corresponding to the words *w* and *w'*, find distance between cluster centers for *i* and *j* and weight them by the product of probabilities of *w* belonging to *i* and *w'* belonging to *j* given their respective contexts. * Average the value over the *k<sup>2</sup>* pairs. ## Training * Dataset * Wikipedia corpus * Parameters * 10-word windows * 100 hidden units * No weight regularization * 10 different word embeddings learnt for words having multiple meanings. ## Evaluation * Dataset * WordSim-353 * 353 pairs of nouns * words represented without context * contains human similarity judgements on pair of words * The paper contributed a new dataset * captures human similarity judgements on pair of words in the context of a sentence * consists of verbs and adjectives along with nouns * for details on how the dataset is constructed, refer the paper * Performance * Proposed model achieves higher correlation to human scores than models using only the local or global context. * Performance can be improved by removing the stop words. * Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given. ## Comments * This work predated the more general word embedding models like [Word2Vec](https://gist.github.com/shagunsodhani/176a283e2c158a75a0a6) and [Glove](https://gist.github.com/shagunsodhani/efea5a42d17e0fcf18374df8e3e4b3e8). While this model performs good at intrinsic evaluation tasks like word similarity, it is outperformed by the more general and recent models on downstream tasks like NER. |