Using the Output Embedding to Improve Language Models on ShortScience.org

dx.doi.org
sci-hub
scholar.google.com

Using the Output Embedding to Improve Language Models
Ofir Press and Lior Wolf
Association for Computational Linguistics - 2017 via Local CrossRef
Keywords:

Summaries/Notes 1

[link] Summary by Ofir Press 6 years ago

## __Background__
RNN language models are composed of:
1. Embedding layer
2. Recurrent layer(s) (RNN/LSTM/GRU/...)
3. Softmax layer (linear transformation + softmax operation)

The embedding matrix and the matrix of the linear transformation just before the softmax operation are of the same size (size_of_vocab * recurrent_state_size) .
They both contain one representation for each word in the vocabulary.

## __Weight Tying__
This paper shows, that by using the same matrix as both the input embedding and the pre-softmax linear transformation (the output embedding), the performance of a wide variety of language models is improved while the number of parameters is massively reduced.
In weight tied models each word has just one representation that is used in both the input and output embedding.

## __Why does weight tying work?__
1. In the paper we show that in un-tied language models, the output embedding contains much better word representations that the input embedding. We show that when the embedding matrices are tied, the quality of the shared embeddings is comparable to that of the output embedding in the un-tied model. So in the tied model the quality of the input and output embeddings is superior to the quality of those embeddings in the un-tied model.
2. In most language modeling tasks because of the small size of the datasets the models tend to overfit. When the number of parameters is reduced in a way that makes sense there is less overfitting because of the reduction in the capacity of the network.

## __Can I tie the input and output embeddings of the decoder of an translation model?__
Yes, we show that this reduces the model's size while not hurting its performance.
In addition, we show that if you preprocess your data using BPE, because of the large overlap between the subword vocabularies of the source and target language, __Three-Way Weight Tying__ can be used. In Three-Way Weight Tying, we tie the input embedding in the encoder to the input and output embeddings of the decoder (so each word has one representation which is used across three matrices).

[This](http://ofir.io/Neural-Language-Modeling-From-Scratch/) blog post contains more details about the weight tying method.

Your comment: