A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification on ShortScience.org

arxiv.org
scholar.google.com

A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification
Zhang, Ye and Wallace, Byron
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 2

[link] Summary by Denny Britz 9 years ago

TLDR; The authors evaluate the impact of hyperparameters (embeddings, filter region size, number of feature maps, activation function, pooling, dropout and l2 norm constraint) on Kim's (2014) CNN for sentence classification. The authors present empirical findings with variance nunbers based on a large number of experiments on 7 classification data sets, and give practical recommendation for architecture decisions.

#### Key Points

- Recommended Baseline configuration: word2vec, (3,4,5) filter regions, 100 feature maps per region size, ReLU activation, 1-max-pooling, 0.5 dropout, l2 norm constraint on weight vector of 3.
- One-hot vectors perform worse than pre-trained embeddings. word2vec outperforms GloVe most of the time.
- Filter region size is dependent on data set in the range of 2-25. Recommended to do a line search over single region size and then combine multiple sizes.
- Increasing the number of feature maps per filter region to more than 600 doesn't seem to help much.
- ReLU almost always best activation function
- Max-pooling almost always best pooling strategy
- Dropout from 0.1 to 0.5 helps, l2 norm constraint not much

#### Notes/Questions

- All datasets analyzed in this paper are rather similar. They have similar average and max sentence length, and even the number of examples is of roughly the same magnitude. It would be interesting to see how the result change with very different datasets, such as long documents, or very large numbers of training examples.

Your comment:

[link] Summary by Marek Rei 7 years ago

The authors perform a hyperparameter search for a single-layer CNN on 9 different sentence classification datasets.
They find that the optimal embedding initialisation, filter size and number of feature maps depends on the dataset and should be chosen through a search; ReLU and tanh are the best activation functions; 1-max pooling is the pooling method; dropout may help when the number of feature maps gets large.

https://i.imgur.com/uUXVwb5.png

Your comment: