ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Deep Networks with Stochastic Depth
Huang, Gao and Sun, Yu and Liu, Zhuang and Sedra, Daniel and Weinberger, Kilian
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: deeplearning, acreuser

[link] Summary by Martin Thoma 9 years ago

**Dropout for layers** sums it up pretty well. The authors built on the idea of [deep residual networks](http://arxiv.org/abs/1512.03385) to use identity functions to skip layers. 

The main advantages:

* Training speed-ups by about 25%
* Huge networks without overfitting

## Evaluation

* [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html): 4.91% error ([SotA](https://martin-thoma.com/sota/#image-classification): 2.72 %) Training Time: ~15h
* [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html): 24.58% ([SotA](https://martin-thoma.com/sota/#image-classification): 17.18 %) Training time: < 16h
* [SVHN](http://ufldl.stanford.edu/housenumbers/):  1.75% ([SotA](https://martin-thoma.com/sota/#image-classification): 1.59 %) - trained for 50 epochs, begging with a LR of 0.1, divided by 10 after 30 epochs and 35. Training time: < 26h

arxiv.org
arxiv-vanity.com
scholar.google.com

Dual Learning for Machine Translation
Yingce Xia and Di He and Tao Qin and Liwei Wang and Nenghai Yu and Tie-Yan Liu and Wei-Ying Ma
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by tqri 8 years ago

In this article, the authors provide a framework for training two translation models with large accessible monolingual corpus.

In traditional methods, machine translation models always require large parallel corpus to train a good quality model, which is expensive to acquire. However, the massive monolingual data is not fully utilized. The monolingual corpus are typically used in pretraining the NMT decoder rnn and augmenting initial parallel corpus through self-generated translations.

The authors embed machine translation task into a reinforcement learning framework, in which two agents act as two different native speakers respectively and know little about each other and then they learn to translate by trying to communicate with each other.

**The two speakers**, `A` and `B`, obviously know well about their corresponding language respectively, this situation is easily simulated by two well-trained language models for `A` and `B`. Then, speaker `A` tries to tell a sentence $x$ to `B` by translating it into $y$ in `B`'s language. Since they don't know each other, `B` is uncertain about what `A` truly means by saying $y$. However, `B` is capable of evaluate the degree of sensibility of $y$ from his own understanding. Next, `B` informs `A` his sensibility evaluation score and tries to recover what `A` truly means in `A`'s language, i.e. $x'$. And similarly, `A` can also evaluate the degree of sensibility of $x'$ from his own understanding.

In general, the very original idea that `A` tried to convey, is passed through a noisy channel to `B`, and then back to `A` through another noisy channel. The former noisy channel is a `A-B` translation model and the latter a `B-A` translation model in the framework.

Think about how the first American learnt Chinese in history and I think it is intuitively similar to the principle in this work.

1 Comments

papers.nips.cc
scholar.google.com

Algorithms for Non-negative Matrix Factorization
Lee, Daniel D. and Seung, H. Sebastian
Neural Information Processing Systems Conference - 2000 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So 

$$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$

Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value.

$$
V = \left[\begin{array}{c c c}
5 & 4 & 1  \\\\
4 & 5 & 1 \\\\
2 & 1 & 5
\end{array}\right]
$$


We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues:


$$
W = \left[\begin{array}{c c c}
-0.656 \\\
 -0.652 \\\
 -0.379
\end{array}\right],
H = \left[\begin{array}{c c c}
-6.48 & -6.26 & -3.20\\\\
\end{array}\right]
$$

We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and  $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$):

$$
W = \left[\begin{array}{c c c}
0.388 \\\\
0.386 \\\\
0.224
\end{array}\right],
H = \left[\begin{array}{c c c}
11.22 & 10.57 & 5.41  \\\\
\end{array}\right]
$$

Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. 

$$
V \approx WH = \left[\begin{array}{c c c}
4.36 & 4.11 & 2.10 \\\
4.33 & 4.08 & 2.09 \\\
2.52 & 2.37 & 1.21 \\\
\end{array}\right]
$$


If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better`



#### Paper Contribution 

This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. 

The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$.



### Still a draft

www.jmlr.org
scholar.google.com

Understanding the difficulty of training deep feedforward neural networks
Glorot, Xavier and Bengio, Yoshua
Journal of Machine Learning Research - 2010 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization**

$$W \sim U \left [  - \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$

where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$.

Showing some ways **how to debug neural networks** might be another reason to read the paper.

The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign).

However, no regularization was used and many mini-batch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much.

Questions that remain open for me:

* [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9)
* Figure 4: Why is this plot not simply completely dependent on the data?
* Is softsign still used? Why not?
* If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{-0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?

1 Comments

arxiv.org
arxiv-vanity.com
scholar.google.com

Wasserstein GAN
Martin Arjovsky and Soumith Chintala and Léon Bottou
arXiv e-Print archive - 2017 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by MarvMind 8 years ago

This very new paper, is currently receiving quite a bit of attention by the [community](https://www.reddit.com/r/MachineLearning/comments/5qxoaz/r_170107875_wasserstein_gan/).

The paper describes a new training approach, which solves the two major practical problems with current GAN training:

1) The training process comes with a meaningful loss. This can be used as a (soft) performance metric and will help debugging, tune parameters and so on.

2) The training process does not suffer from all the instability problems. In particular the paper reduces mode collapse significantly.

On top of that, the paper comes with quite a bit mathematical theory, explaining why there approach works and other approachs have failed. This paper is a must read for anyone interested in GANs.