[link]
Summary by Shagun Sodhani 7 years ago
# Skip-Thought Vectors
## Introduction
* The paper describes an unsupervised approach to train a generic, distributed sentence encoder.
* It also describes a vocabulary expansion method to encode words not seen at training time.
* [Link to the paper](https://arxiv.org/abs/1506.06726)
## Skip-Thoughts
* Train an encoder-decoder model where the encoder maps the input sentence to a sentence vector and the decoder generates the sentences surrounding the original sentence.
* The model is called **skip-thoughts** and the encoded vectors are called **skip-thought vectors.**
* Similar to the [skip-gram](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) model in the sense that surrounding sentences are used to learn sentence vectors.
### Architecture
* Training data is in form of sentence tuples (previous sentence, current sentence, next sentence).
* **Encoder**
* RNN Encoder with GRU.
* **Decoder**
* RNN Decoder with conditional GRU.
* Conditioned on encoder output.
* Extra matrices introduced to bias the update gate, reset gate and hidden state, given the encoder output.
* **Vocabulary matrix (V)** - Weight matrix having one row (vector) for each word in the vocabulary.
* Separate decoders for the previous and next sentence which share only **V**.
* Given the decoder context **h** (at any time), encoder output, and list of words already generated for the output sentence, the probability of choosing *w* as the next word is proportional to *exp(**V(*word*)h**)*
* **Objective**
* Sum of the log-probabilities for the forward and backwards sentences conditioned on the encoder output.
## Vocabulary Expansion
* Use a model like Word2Vec which can be trained to induce word representations and train it to obtain embeddings for all the words that are likely to be seen by the encoder.
* Learn a matrix **W** such that *encoder(word) = cross-product(W, Word2Vec(word))* for all words that are common to both Word2Vec model and encoder model.
* Use **W** to generate embeddings for words are not seen during encoder training.
## Dataset
* [BookCorpus dataset](https://arxiv.org/abs/1506.06724) having books across 16 genres.
## Training
* **uni-skip**
* Unidirectional auto-encoder with 2400 dimensions.
* **bi-skip**
* Bidirectional model with forward (sentence given in correct order) and backward (sentence given in reverse order) encoders of 1200 dimensions each.
* **combine-skip**
* concatenation of uni-skip and bi-skip vectors.
* Initialization
* Recurrent matricies - orthogonal initialization.
* Non-recurrent matricies - uniform distribution in [-0.1,0.1].
* Mini-batches of size 128.
* Gradient Clipping at norm = 10.
* Adam optimizer.
## Experiments
* After learning skip-thoughts, freeze the model and use the encoder as feature extractor only.
* Evaluated the vectors with linear models on following tasks:
### Semantic Relatedness
* Given a sentence pair, predict how closely related the two sentences are.
* **skip-thoughts** method outperforms all systems from SemEval 2014 competition and is outperformed only by dependency tree-LSTMs.
* Using features learned from image-sentence embedding model on COCO boosts performance and brings it at par with dependency tree-LSTMs.
### Paraphrase detection
* **skip-thoughts** outperforms recursive nets with dynamic pooling if no hand-crafted features are used.
* **skip-thoughts** with basic pairwise statistics produce results comparable with the state-of-the-art systems that house complicated features and hand engineering.
### Image-sentence Ranking
* MS COCO dataset
* Task
* Image annotation
* Given an image, rank the sentences on basis of how well they describe the image.
* Image search - Given a caption, find the image that is being described.
* Though the system does not outperform baseline system in all cases, the results does indicate that skip-thought vectors can capture image descriptions without having to learn their representations from scratch.
### Classification
* **skip-thoughts** perform about as good as bag-of-words baselines but are outperformed by methods where sentence representation has been learnt for the task at hand.
* Combining **skip-thoughts** with bi-gram Naive Bayes (NB) features improves the performance.
## Future Work
* Variants to be explored include:
* Fine tuning the encoder-decoder model during the downstream task instead of freezing the weights.
* Deep encoders and decoders.
* Larger context windows.
* Encoding and decoding paragraphs.
* Encoders, such as convnets.
more
less