Summaries from Association for Computational Linguistics on ShortScience.org

www.aclweb.org
sci-hub
scholar.google.com
Improving Word Representations via Global Context and Multiple Word Prototypes
Huang, Eric H. and Socher, Richard and Manning, Christopher D. and Ng, Andrew Y.
Association for Computational Linguistics - 2012 via Local Bibsonomy
Keywords: nlp
[link] Summary by Shagun Sodhani 8 years ago
# Improving Word Representations via Global Context and Multiple Word Prototypes

## Introduction

* This paper pre-dated papers like Glove and Word2Vec and proposed an architecture that
    * combines local and global context while learning word embeddings to capture the word semantics.
    * learns multiple embeddings per word to account for homonymy and polysemy.
* [Link to the paper](http://www.aclweb.org/anthology/P12-1092)

## Global Context-Aware Neural Language Model

### Training Objective

* Given a word sequence *s* (local context) and a document *d* in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in *s* from other words.
* *g(s, d)* - scoring function giving liklihood of correct sequence.
* *g(s<sup>w</sup>, d)* - scoring function giving liklihood of last word in *s* repalced by a word *w*.
* Objective - *g(s, d)* > *g(s<sup>w</sup>, d)* + 1 for any other word *w*.

### Architecture

* Two scoring components (neural networks) to capture:
    
    * Local Context
        * Map word sequence *s* into an ordered list of vectors *x = [x<sub>1</sub>, ..., x<sub>m</sub>]*.
        * *x<sub>i</sub>* - embedding corresponding to *i<sup>th</sup>* word in the sequence.
        * Compute local score *score<sub>l</sub>* by using a neural network (with one hidden layer) over *x*.
        * Preserves word order and syntactic information.
    * Global Context
        * Map document *d* to an ordered list of word embeddings, *d = (d<sub>1</sub>, ..., d<sub>k</sub>)*.
        * Compute *c*, the weighted average of all word vectors in document.
        * The paper uses *idf* score for weighting the documents.
        * *x = * concatenation of *c* and vector of the last word in *s*.
        * Compute global score *score<sub>g</sub>* by using a neural network (with two hidden layers) over *x*.
        * Similar to bag-of-words features.
    *score = score<sub>l</sub> + score<sub>g</sub>*
    * Train the weights of the hidden layers and the word embeddings.

### Multi-Prototype Neural Language Model

* Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word.
* Solution - train multiple vectors per word to capture the different meanings.
* Approach
    
    * Gather all the fixed-sized context windows for all occurrences of a given word.
    * Find the context vector by performing weighted averaging of all the words in the context window.
    * Cluster the context vectors using spherical k-means.
    * Each word occurrence in the corpus is re-labeled to its associated cluster.
    * To find similarity between a pair of words *(w, w')*:
        * For each possible cluster of *i* and *j* corresponding to the words *w* and *w'*, find distance between cluster centers for *i* and *j* and weight them by the product of probabilities of *w* belonging to *i* and *w'* belonging to *j* given their respective contexts.
        * Average the value over the *k<sup>2</sup>* pairs.

## Training
    
* Dataset
  * Wikipedia corpus

* Parameters
  * 10-word windows
  * 100 hidden units
  * No weight regularization
  * 10 different word embeddings learnt for words having multiple meanings.

## Evaluation

* Dataset
  * WordSim-353 
      * 353 pairs of nouns
      * words represented without context
      * contains human similarity judgements on pair of words
  * The paper contributed a new dataset
      * captures human similarity judgements on pair of words in the context of a sentence
      * consists of verbs and adjectives along with nouns
      * for details on how the dataset is constructed, refer the paper

* Performance
  * Proposed model achieves higher correlation to human scores than models using only the local or global context.
  * Performance can be improved by removing the stop words.
  * Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given.

## Comments

* This work predated the more general word embedding models like [Word2Vec](https://gist.github.com/shagunsodhani/176a283e2c158a75a0a6) and [Glove](https://gist.github.com/shagunsodhani/efea5a42d17e0fcf18374df8e3e4b3e8). While this model performs good at intrinsic evaluation tasks like word similarity, it is outperformed by the more general and recent models on downstream tasks like NER.