[link]
Summary by Shagun Sodhani 7 years ago
# Improving Word Representations via Global Context and Multiple Word Prototypes
## Introduction
* This paper pre-dated papers like Glove and Word2Vec and proposed an architecture that
* combines local and global context while learning word embeddings to capture the word semantics.
* learns multiple embeddings per word to account for homonymy and polysemy.
* [Link to the paper](http://www.aclweb.org/anthology/P12-1092)
## Global Context-Aware Neural Language Model
### Training Objective
* Given a word sequence *s* (local context) and a document *d* in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in *s* from other words.
* *g(s, d)* - scoring function giving liklihood of correct sequence.
* *g(s<sup>w</sup>, d)* - scoring function giving liklihood of last word in *s* repalced by a word *w*.
* Objective - *g(s, d)* > *g(s<sup>w</sup>, d)* + 1 for any other word *w*.
### Architecture
* Two scoring components (neural networks) to capture:
* Local Context
* Map word sequence *s* into an ordered list of vectors *x = [x<sub>1</sub>, ..., x<sub>m</sub>]*.
* *x<sub>i</sub>* - embedding corresponding to *i<sup>th</sup>* word in the sequence.
* Compute local score *score<sub>l</sub>* by using a neural network (with one hidden layer) over *x*.
* Preserves word order and syntactic information.
* Global Context
* Map document *d* to an ordered list of word embeddings, *d = (d<sub>1</sub>, ..., d<sub>k</sub>)*.
* Compute *c*, the weighted average of all word vectors in document.
* The paper uses *idf* score for weighting the documents.
* *x = * concatenation of *c* and vector of the last word in *s*.
* Compute global score *score<sub>g</sub>* by using a neural network (with two hidden layers) over *x*.
* Similar to bag-of-words features.
*score = score<sub>l</sub> + score<sub>g</sub>*
* Train the weights of the hidden layers and the word embeddings.
### Multi-Prototype Neural Language Model
* Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word.
* Solution - train multiple vectors per word to capture the different meanings.
* Approach
* Gather all the fixed-sized context windows for all occurrences of a given word.
* Find the context vector by performing weighted averaging of all the words in the context window.
* Cluster the context vectors using spherical k-means.
* Each word occurrence in the corpus is re-labeled to its associated cluster.
* To find similarity between a pair of words *(w, w')*:
* For each possible cluster of *i* and *j* corresponding to the words *w* and *w'*, find distance between cluster centers for *i* and *j* and weight them by the product of probabilities of *w* belonging to *i* and *w'* belonging to *j* given their respective contexts.
* Average the value over the *k<sup>2</sup>* pairs.
## Training
* Dataset
* Wikipedia corpus
* Parameters
* 10-word windows
* 100 hidden units
* No weight regularization
* 10 different word embeddings learnt for words having multiple meanings.
## Evaluation
* Dataset
* WordSim-353
* 353 pairs of nouns
* words represented without context
* contains human similarity judgements on pair of words
* The paper contributed a new dataset
* captures human similarity judgements on pair of words in the context of a sentence
* consists of verbs and adjectives along with nouns
* for details on how the dataset is constructed, refer the paper
* Performance
* Proposed model achieves higher correlation to human scores than models using only the local or global context.
* Performance can be improved by removing the stop words.
* Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given.
## Comments
* This work predated the more general word embedding models like [Word2Vec](https://gist.github.com/shagunsodhani/176a283e2c158a75a0a6) and [Glove](https://gist.github.com/shagunsodhani/efea5a42d17e0fcf18374df8e3e4b3e8). While this model performs good at intrinsic evaluation tasks like word similarity, it is outperformed by the more general and recent models on downstream tasks like NER.
more
less