#### Introduction
* The paper demonstrates how simple CNNs, built on top of word embeddings, can be used for sentence classification tasks.
* [Link to the paper](https://arxiv.org/abs/1408.5882)
* [Implementation](https://github.com/shagunsodhani/CNN-Sentence-Classifier)
#### Architecture
* Pad input sentences so that they are of the same length.
* Map words in the padded sentence using word embeddings (which may be either initialized as zero vectors or initialized as word2vec embeddings) to obtain a matrix corresponding to the sentence.
* Apply convolution layer with multiple filter widths and feature maps.
* Apply max-over-time pooling operation over the feature map.
* Concatenate the pooling results from different layers and feed to a fully-connected layer with softmax activation.
* Softmax outputs probabilistic distribution over the labels.
* Use dropout for regularisation.
#### Hyperparameters
* RELU activation for convolution layers
* Filter window of 3, 4, 5 with 100 feature maps each.
* Dropout - 0.5
* Gradient clipping at 3
* Batch size - 50
* Adadelta update rule.
#### Variants
* CNN-rand
* Randomly initialized word vectors.
* CNN-static
* Uses pre-trained vectors from word2vec and does not update the word vectors.
* CNN-non-static
* Same as CNN-static but updates word vectors during training.
* CNN-multichannel
* Uses two set of word vectors (channels).
* One set is updated and other is not updated.
#### Datasets
* Sentiment analysis datasets for Movie Reviews, Customer Reviews etc.
* Classification data for questions.
* Maximum number of classes for any dataset - 6
#### Strengths
* Good results on benchmarks despite being a simple architecture.
* Word vectors obtained by non-static channel have more meaningful representation.
#### Weakness
* Small data with few labels.
* Results are not very detailed or exhaustive.
This paper reports on a series of experiments with CNNs trained
on top of pre-trained word vectors for sentence-level classification
tasks. The model achieves very good performance across datasets, and
state-of-the-art on a few. The proposed model has an input layer
comprising of concatenated 'word2vec' embeddings, followed by a single
convolutional layer with multiple filters, max-pooling over time,
fully connected layers and softmax. They also experiment with static
and non-static channels which basically implies whether they finetune
word2vec embeddings or not.
## Strengths
- Very simple yet powerful model formulation, which achieves really good
performance across datasets.
- The different model formulations drive home the point that initializing
input vectors with word2vec embeddings is better than random initializations.
Finetuning these embeddings for the task leads to further improvements over
static embeddings.
## Weaknesses / Notes
- No intuition as to why the model with both static and non-static channels
gives mixed results.
- They briefly mention that they experimented with SENNA embeddings which lead
to worse results although no quantitative results are provided. It would have been
interesting to have a comparative study with GloVe embeddings as well.