Alternative structures for character-level RNNs
Bojanowski, Piotr
and
Joulin, Armand
and
Mikolov, Tomas
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords:
dblp
TLDR; The authors propose two different architectures to improve the performance of character-level RNNs. In the first architecture ("mixed") the authors condition the model on the state of a word-level RNN. In the second architecture ("cond") they condition the output classifier on character n-grams. The authors show that the proposed architecture outperform plain character-level RNNs in terms of entropy in bits per character.
#### Key Points
- Plain character-level RNNs need a huge hidden representation in order to model long-term dependencies. But Word-level RNNs can't generalize to new vocabulary and may require a huge output vocab.
- Model 1: Jointly train word-level and char-level CNN. Interpolate the losses of the two models.
- Model 2: Condition softmax on n-grams before character, "relieving" the network of memorizing some of the sequence.
- Training: Constant learning rate, reduce every epoch when validation accuracy decreases
- N-gram model can be applied to arbitrary data, not just characters. Authors evaluate on binary data.
#### Notes / Questions
- In the comparison table the authors don't show the number of parameters for the models. They compare models with the same number of hidden units, but their proposed architecture need extra parameters and computation. Unfair comparison?
- People typically use LSTMs/GRUs for language modeling. Of course the proposed techniques can be applied to LSTM/GRU networks, but the experimental result may look very different. Do these architectures result in any benefit when using LSTM/GRU char data?
- Entropy in bits per character seems like somewhat of a strange evaluation metric. I don't really know what to make of it, and no intuitive explanations are given.
- One argument the authors make in the paper is that character-level models can be applied to arbitrary input data (different languages, binary data, code, etc). But their mixed is clearly very language-specific. It can't be applied to arbitrary data, and many languages don't have clear word boundaries. Similarly, n-grams may be prohibituvely expensive depending on what kind of data we're working with.
- The n-gram conditioned models isn't clearly explained, I *think* I understand what it does, but I'm not quite sure. No intuitive explanations what any of the models are learning are given.
This paper introduces two model extensions to improve character level recurrent neural network language models. The authors evaluate their approaches on a multilingual language modeling benchmark along with the standard Penn Tree Bank Corpus. Evaluation uses only entropy rather than including the language model in a downstream task but that's okay for a paper of this scope. The paper is clearly written and definitely a sufficient contribution for the workshop track it would be really nice to see how well these methods can improve and more sophisticated recurrent architecture like gru or lstm units. On the PTB Corpus it would be nice to include a state-of-the-art or standard n-gram model to use as a reference point for the reported results.
The conditioning on words model is an interesting approach. It's unfortunate that such a small word level vocabulary is used with this model. It seems like the small vocabulary restriction is due to the fact that the word level model is jointly trained along with the character models. An alternative approach might be to use as input features the hidden representations from a word level recurrent model already trained when building the Character level language model. I don't have a good sense for how much joint training of both models matters.