ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 9 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

arxiv.org
scholar.google.com

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
Iandola, Forrest N. and Moskewicz, Matthew W. and Ashraf, Khalid and Han, Song and Dally, William J. and Keutzer, Kurt
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by unmesh 9 years ago

$\bf Summary:$
The paper is about squeezing the number of parameters in a convolutional neural network. The number of parameters in a convolutional layer is given by (number of input channels)$\times$(number of filters)$\times$(size of filter$\times$size of filter).

The paper proposes 2 strategies: (i) replace 3x3 filters with 1x1 filters and (ii) decrease the number of input channels. They assume the budget of the filter is given, i,e., they do not tinker with the number of filters. Decrease in number of parameters will lead to less accuracy. To compensate, the authors propose to downsample late in the network. 

The results are quite impressive. Compared to AlexNet, they achieve a 50x reduction is model size while preserving the accuracy. Their model can be further compressed with existing methods like Deep Compression which are orthogonal to this paper's approach and this can give in total of around 510x reduction while still preserving accuracy of AlexNet.

$\bf Question$: The impact on running times (specially on feed forward phase which may be more typical on embedded devices) is not clear to me. Is it certain to be reduced as well or at least be *no worse* than the baseline models?

arxiv.org
scholar.google.com

Exploring the Limits of Language Modeling
Józefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by yangjunpro 9 years ago

#### This nice paper looks amazing at the first sight since it brings a mixture of:
- Fancy models 
- State-of-art training procedure(considering the 32-GPU distributed training effort which takes 21 days to get the best result) 
- Significant theory metric improvement(single model: 51.3 -> 30 perplexity reduction, ensemble model:41.0 -> 23.7)
- Benchmark on a somewhat industry scale(vocabulary of 793471 words,  0,8B words training data) data-set rather than a pure research one.
    
#### However, I also want to add some criticism:
- As [1] mentioned perplexity is somewhat confusing metric, big perplexity may not reflect the real improvement, it would rather bring some kind of "exaggerating" effect.
- This paper only provide the language model improvement, however, LMs are usually embedded into a complex usage scenario, such as speech recognition or machine translation. It would be more insightful if the LMs provided in this paper could share its result with integrating into some end-to-end products. Since the authors are working for Google Brain team, this is not too much a stringent requirement. 
- So far as I know, the data set used by this paper is from news stories[2], this kind of data set is more formal than oral one. And for real application, what we face are usually less formal data(such as search engine and speech recognition). It is still a question what the best model mentioned in this paper will perform in a more realistic scenario. Again, for Google Brain team, this should not be a big obstacles for integrating it with existing system just by replacing or complementing the existing LMs.
     
Although I posted some personal criticism, I do still appreciate this nice paper and recommend this as a "must-read" for NLP and related guys since I do think this paper provide a unifying and comprehensive survey-style perspective for us to help grasp the latest state-of-art language model technology in an efficient way. 
 
References:
- [1].http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf
- [2].http://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/41880.pdf

www.jmlr.org
scholar.google.com

Visualizing data using t-SNE
Van der Maaten, Laurens and Hinton, Geoffrey
Journal of Machine Learning Research - 2008 via Local Bibsonomy
Keywords: visualization, clustering, machine_learning, t-SNE, PCA

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* Method to visualize high-dimensional data points in 2/3 dimensional space.
* Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation.
* Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold.
* t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix.
* Based on SNE (Stochastic Neighbor Embedding)
* [Link to paper](http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)

#### SNE

* Given a set of datapoints $x_1, x_2, ...x_n, p_{i|j}$ is the probability that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$. Calculation of $\sigma_i$ is described later.
* Similarly, define $q_{i|j}$ as conditional probability corresponding to low-dimensional representations of $y_i$ and $y_j$ (corresponding to $x_i$ and $x_j$). The variance of Gaussian in this case is set to be $1/\sqrt{2}$
* Argument is that if $y_i$ and $y_j$ captures the similarity between $x_i$ and $x_j$, $p_{i|j}$ and $q_{i|j}$ should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for $P_i$ and $Q_i$, where $P_i$ ($Q_i$) represent conditional probability distribution given $x_i$ ($y_i$)
* Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure.
* Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find $\sigma_i$ which produces the $P_i$ having same perplexity as specified by the user.
* Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing.

#### t-SNE (t-Distributed SNE)

##### Symmetric SNE

* A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized.
* Symmetric because $p_{i|j} = p_{j|i}$ and $q_{i|j} = q_{j|i}$
* More robust to outliers and has a simpler gradient expression.

##### Crowding Problem

* When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters.
* One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult.

##### t-SNE

* Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space.
* Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast. 
* The cost function is easy to optimize.

##### Optimization Tricks

###### Early Compression

* At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another. 
* Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin.

###### Early Exaggeration

* Scale all the $p_{i|j}$'s so that large $q_{i|j}$*'s are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space.

##### t-SNE on large datasets

* Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets.
* Select a random subset of points (called landmark points) to display.
* for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point.
* $p_{i|j}$ is defined as fraction of random walks starting at $x_i$ and finishing at $x_j$ (both these points are landmark points). This way, $p_{i|j}$ is not sensitive to "short-circuits" in the graph (due to noisy data points).

#### Advantages of t-SNE

* Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data.
* Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates.
* The local neighborhood size of each datapoint is determined on the basis of the local density of the data.
* Random walk version of t-SNE takes care of "short-circuit" problem.

#### Limitations of t-SNE

* It is unclear t-SNE would perform on general **Dimensionality Reduction** for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate.
* t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (**curse of dimensionality**).
* The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution.

scholar.google.com

Imagenet classification with deep convolutional neural networks
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E
Neural Information Processing Systems Conference - 2012 via Local Bibsonomy
Keywords: image, imagenet, thema:deepwalk, classification

[link] Summary by Martin Thoma 9 years ago

This paper is about Convolutional Neural Networks for Computer Vision. It was the first break-through in the ImageNet classification challenge (LSVRC-2010, 1000 classes).

ReLU was a key aspect which was not so often used before. The paper also used Dropout in the last two layers.

## Training details

* Momentum of 0.9
* Learning rate of $\varepsilon$ (initialized at 0.01)
* Weight decay of $0.0005 \cdot \varepsilon$.
* Batch size of 128
* The training took 5 to 6 days on two NVIDIA GTX 580 3GB GPUs.

## See also

* [Stanford presentation](http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf)