Bag of Tricks for Efficient Text Classification
Armand Joulin
and
Edouard Grave
and
Piotr Bojanowski
and
Tomas Mikolov
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CL
First published: 2016/07/06 (8 years ago) Abstract: This paper explores a simple and efficient baseline for text classification.
Our experiments show that our fast text classifier fastText is often on par
with deep learning classifiers in terms of accuracy, and many orders of
magnitude faster for training and evaluation. We can train fastText on more
than one billion words in less than ten minutes using a standard multicore~CPU,
and classify half a million sentences among~312K classes in less than a minute.
#### Introduction
* Introduces fastText, a simple and highly efficient approach for text classification.
* At par with deep learning models in terms of accuracy though an order of magnitude faster in performance.
* [Link to the paper](http://arxiv.org/abs/1607.01759v3)
* [Link to code](https://github.com/facebookresearch/fastText)
#### Architecture
* Built on top of linear models with a rank constraint and a fast loss approximation.
* Start with word representations that are averaged into text representation and feed them to a linear classifier.
* Think of text representation as a hidden state that can be shared among features and classes.
* Softmax layer to obtain a probability distribution over pre-defined classes.
* High computational complexity $O(kh)$, $k$ is the number of classes and $h$ is dimension of text representation.
##### Hierarchial Softmax
* Based on Huffman Coding Tree
* Used to reduce complexity to $O(hlog(k))$
* Top T results (from the tree) can be computed efficiently $O(logT)$ using a binary heap.
##### N-gram Features
* Instead of explicitly using word order, uses a bag of n-grams to maintain efficiency without losing on accuracy.
* Uses [hashing trick](https://arxiv.org/pdf/0902.2206.pdf) to maintain fast and memory efficient mapping of the n-grams.
#### Experiments
##### Sentiment Analysis
* fastText benefits by using bigrams.
* Outperforms [char-CNN](http://arxiv.org/abs/1502.01710v5) and [char-CRNN](http://arxiv.org/abs/1602.00367v1) and performs a bit worse than [VDCNN](http://arxiv.org/abs/1606.01781v1).
* Order of magnitudes faster in terms of training time.
* Note: fastText does not use pre-trained word embeddings.
##### Tag Prediction
* fastText with bigrams outperforms [Tagspace](http://emnlp2014.org/papers/pdf/EMNLP2014194.pdf).
* fastText performs upto 600 times faster at test time.