ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 9 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

arxiv.org
scholar.google.com

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
Iandola, Forrest N. and Moskewicz, Matthew W. and Ashraf, Khalid and Han, Song and Dally, William J. and Keutzer, Kurt
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by unmesh 9 years ago

$\bf Summary:$
The paper is about squeezing the number of parameters in a convolutional neural network. The number of parameters in a convolutional layer is given by (number of input channels)$\times$(number of filters)$\times$(size of filter$\times$size of filter).

The paper proposes 2 strategies: (i) replace 3x3 filters with 1x1 filters and (ii) decrease the number of input channels. They assume the budget of the filter is given, i,e., they do not tinker with the number of filters. Decrease in number of parameters will lead to less accuracy. To compensate, the authors propose to downsample late in the network. 

The results are quite impressive. Compared to AlexNet, they achieve a 50x reduction is model size while preserving the accuracy. Their model can be further compressed with existing methods like Deep Compression which are orthogonal to this paper's approach and this can give in total of around 510x reduction while still preserving accuracy of AlexNet.

$\bf Question$: The impact on running times (specially on feed forward phase which may be more typical on embedded devices) is not clear to me. Is it certain to be reduced as well or at least be *no worse* than the baseline models?

arxiv.org
scholar.google.com

Exploring the Limits of Language Modeling
Józefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by yangjunpro 9 years ago

#### This nice paper looks amazing at the first sight since it brings a mixture of:
- Fancy models 
- State-of-art training procedure(considering the 32-GPU distributed training effort which takes 21 days to get the best result) 
- Significant theory metric improvement(single model: 51.3 -> 30 perplexity reduction, ensemble model:41.0 -> 23.7)
- Benchmark on a somewhat industry scale(vocabulary of 793471 words,  0,8B words training data) data-set rather than a pure research one.
    
#### However, I also want to add some criticism:
- As [1] mentioned perplexity is somewhat confusing metric, big perplexity may not reflect the real improvement, it would rather bring some kind of "exaggerating" effect.
- This paper only provide the language model improvement, however, LMs are usually embedded into a complex usage scenario, such as speech recognition or machine translation. It would be more insightful if the LMs provided in this paper could share its result with integrating into some end-to-end products. Since the authors are working for Google Brain team, this is not too much a stringent requirement. 
- So far as I know, the data set used by this paper is from news stories[2], this kind of data set is more formal than oral one. And for real application, what we face are usually less formal data(such as search engine and speech recognition). It is still a question what the best model mentioned in this paper will perform in a more realistic scenario. Again, for Google Brain team, this should not be a big obstacles for integrating it with existing system just by replacing or complementing the existing LMs.
     
Although I posted some personal criticism, I do still appreciate this nice paper and recommend this as a "must-read" for NLP and related guys since I do think this paper provide a unifying and comprehensive survey-style perspective for us to help grasp the latest state-of-art language model technology in an efficient way. 
 
References:
- [1].http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf
- [2].http://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/41880.pdf

scholar.google.com

Imagenet classification with deep convolutional neural networks
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E
Neural Information Processing Systems Conference - 2012 via Local Bibsonomy
Keywords: image, imagenet, thema:deepwalk, classification

[link] Summary by Martin Thoma 9 years ago

This paper is about Convolutional Neural Networks for Computer Vision. It was the first break-through in the ImageNet classification challenge (LSVRC-2010, 1000 classes).

ReLU was a key aspect which was not so often used before. The paper also used Dropout in the last two layers.

## Training details

* Momentum of 0.9
* Learning rate of $\varepsilon$ (initialized at 0.01)
* Weight decay of $0.0005 \cdot \varepsilon$.
* Batch size of 128
* The training took 5 to 6 days on two NVIDIA GTX 580 3GB GPUs.

## See also

* [Stanford presentation](http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf)

scholar.google.com

Distributed representations of words and phrases and their compositionality
Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff
Advances in neural information processing systems - 2013 via Local Bibsonomy
Keywords: thema:deepwalk, language, modelling, representation

[link] Summary by NIPS Conference Reviews 9 years ago

The paper discusses a number of extensions to the Skip-gram model previously proposed by Mikolov et al (citation [7] in the paper): which learns linear word embeddings that are particularly useful for analogical reasoning type tasks. The extensions proposed (namely, negative sampling and sub-sampling of high frequency words) enable extremely fast training of the model on large scale datasets. This also results in significantly improved performance as compared to previously proposed techniques based on neural networks. The authors also provide a method for training phrase level embeddings by slightly tweaking the original training algorithm.

This paper proposes 3 improvements for the skip-gram model which allows for learning embeddings for words. The first improvement is subsampling frequent word, the second is the use of a simplified version of noise constrastive estimation (NCE) and finally they propose a method to learn idiomatic phrase embeddings. In all three cases the improvements are somewhat ad-hoc. In practice, both the subsampling and negative samples help to improve generalization substantially on an analogical reasoning task. The paper reviews related work and furthers the interesting topic of additive compositionality in embeddings.

The article does not propose any explanation as to why the negative sampling produces better results than NCE which it is suppose to loosely approximate. In fact it doesn't explain why besides the obvious generalization gain the negative sampling scheme should be preferred to NCE since they achieve similar speeds.