Deep Unordered Composition Rivals Syntactic Methods for Text Classification on ShortScience.org

aclweb.org
scholar.google.com

Deep Unordered Composition Rivals Syntactic Methods for Text Classification
Iyyer, Mohit and Manjunatha, Varun and Boyd-Graber, Jordan L. and III, Hal Daumé
Association for Computational Linguistics - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Tim Miller 6 years ago

The authors explore the properties of "Deep Averaging Networks" on text classification problems, specifically sentiment and question answering tasks. DANs extend neural bag of words models, starting with a document representation that is the average of the word embeddings in that document, but extending to multiple feed-forward layers. The authors argue that these models are much simpler and faster to train than syntax and composition-based RNNs, while obtaining similar performance. Since this paper is actually arguing for simpler models, there is little technically here to understand, so the real contribution of the paper are the interesting experiments exploring how the DANs represent various phenomena. They show that differences between graded sentiment words (awesome, cool, ok, underwhelming, the worst) are magnified as layers are added. This shows the benefit of depth relative to a neural bag of words. Then they compare against RNNs with examples containing negation and contrastive conjunctions (e.g., but), which are traditionally modeled syntactically. They show that existing methods that we think can represent syntax/composition in fact are not strong enough. Something like "not bad" fully exposes the DAN -- it doubles the negation. But while the RNN-based models can learn not to simply double the negation, they are not powerful enough to reverse the polarity and get the example correct.

Finally, the authors introduce one novel mechanism for improving training, "word dropout." Similar to standard dropout, they randomly sample a subset of words at the input layer that are not used as part of the document representation. This gives the network multiple looks at each example with part of its feature space removed. Another way to think of this is data augmentation where new training instances are created by sampling feature vectors from existing data points with some features missing.

Your comment: