ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Dual Learning for Machine Translation
Yingce Xia and Di He and Tao Qin and Liwei Wang and Nenghai Yu and Tie-Yan Liu and Wei-Ying Ma
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by tqri 8 years ago

In this article, the authors provide a framework for training two translation models with large accessible monolingual corpus.

In traditional methods, machine translation models always require large parallel corpus to train a good quality model, which is expensive to acquire. However, the massive monolingual data is not fully utilized. The monolingual corpus are typically used in pretraining the NMT decoder rnn and augmenting initial parallel corpus through self-generated translations.

The authors embed machine translation task into a reinforcement learning framework, in which two agents act as two different native speakers respectively and know little about each other and then they learn to translate by trying to communicate with each other.

**The two speakers**, `A` and `B`, obviously know well about their corresponding language respectively, this situation is easily simulated by two well-trained language models for `A` and `B`. Then, speaker `A` tries to tell a sentence $x$ to `B` by translating it into $y$ in `B`'s language. Since they don't know each other, `B` is uncertain about what `A` truly means by saying $y$. However, `B` is capable of evaluate the degree of sensibility of $y$ from his own understanding. Next, `B` informs `A` his sensibility evaluation score and tries to recover what `A` truly means in `A`'s language, i.e. $x'$. And similarly, `A` can also evaluate the degree of sensibility of $x'$ from his own understanding.

In general, the very original idea that `A` tried to convey, is passed through a noisy channel to `B`, and then back to `A` through another noisy channel. The former noisy channel is a `A-B` translation model and the latter a `B-A` translation model in the framework.

Think about how the first American learnt Chinese in history and I think it is intuitively similar to the principle in this work.

1 Comments

www.jmlr.org
scholar.google.com

Understanding the difficulty of training deep feedforward neural networks
Glorot, Xavier and Bengio, Yoshua
Journal of Machine Learning Research - 2010 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization**

$$W \sim U \left [  - \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$

where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$.

Showing some ways **how to debug neural networks** might be another reason to read the paper.

The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign).

However, no regularization was used and many mini-batch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much.

Questions that remain open for me:

* [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9)
* Figure 4: Why is this plot not simply completely dependent on the data?
* Is softsign still used? Why not?
* If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{-0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?

1 Comments

arxiv.org
scholar.google.com

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
Iandola, Forrest N. and Moskewicz, Matthew W. and Ashraf, Khalid and Han, Song and Dally, William J. and Keutzer, Kurt
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by unmesh 8 years ago

$\bf Summary:$
The paper is about squeezing the number of parameters in a convolutional neural network. The number of parameters in a convolutional layer is given by (number of input channels)$\times$(number of filters)$\times$(size of filter$\times$size of filter).

The paper proposes 2 strategies: (i) replace 3x3 filters with 1x1 filters and (ii) decrease the number of input channels. They assume the budget of the filter is given, i,e., they do not tinker with the number of filters. Decrease in number of parameters will lead to less accuracy. To compensate, the authors propose to downsample late in the network. 

The results are quite impressive. Compared to AlexNet, they achieve a 50x reduction is model size while preserving the accuracy. Their model can be further compressed with existing methods like Deep Compression which are orthogonal to this paper's approach and this can give in total of around 510x reduction while still preserving accuracy of AlexNet.

$\bf Question$: The impact on running times (specially on feed forward phase which may be more typical on embedded devices) is not clear to me. Is it certain to be reduced as well or at least be *no worse* than the baseline models?

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 9 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

arxiv.org
arxiv-vanity.com
scholar.google.com

Rao-Blackwellized Stochastic Gradients for Discrete Distributions
Runjing Liu and Jeffrey Regier and Nilesh Tripuraneni and Michael I. Jordan and Jon McAuliffe
arXiv e-Print archive - 2018 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by Gavin Gray 5 years ago

This paper approaches the problem of optimizing parameters of a discrete distribution with respect to some loss function that is an expectation over that distribution. In other words, an experiment will probably be a variational autoencoder with discrete latent variables, but there are many real applications:

$$
\mathcal{L} (\eta) : = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right]
$$

Using the [product rule of differentiation][product] the derivative of this loss function can be computed by enumerating all $1 \to K$ possible values of $z$:

$$
\nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \nabla_\eta \sum_{k=1}^{K} q_\eta (k) f_\eta (k) \\
=  \sum_{k=1}^{K}  f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta  f_\eta (k)
$$

This expectation can also be expressed as the score function estimator (aka the REINFORCE estimator):

$$
\nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \sum_{k=1}^{K}  \left(f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta  f_\eta (k)\right)\frac{q_\eta (k)}{q_\eta (k)} \\
= \sum_{k=1}^{K} q_\eta (k) f_\eta (k) \nabla_\eta \log q_\eta (k) + q_\eta (k) \nabla_\eta  f_\eta (k) \\
=  \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_\eta (k) \nabla_\eta \log q_\eta (k) +  \nabla_\eta  f_\eta (k) \right] \\
= \sum_{k=1}^{K}  f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta  f_\eta (k) =  \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \right]
$$
In other words, both can be referred to as estimators $g(z)$.

The authors note that this can be calculated over a subset of the $k$ most probable states (overloading their $k$ from possible values of $z$). Call this set $C_k$:

$$
\nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right]  = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \right] \\
=  \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \mathbb{1}\{ z \in C_k\} + g(z) \mathbb{1} \{ z \notin C_k \} \right] \\
= \sum_{z \in C_k} q_\eta(z) g(z) +  \mathbb{E}_{z \sim q_{\eta} (z)} \left[  g(z) \mathbb{1} \{ z \notin C_k \} \right]
$$

As long as $k$ is small, it's easy to calculate the first term, and if most of the probability mass is contained in that set, then it shouldn't matter how well we approximate the second term.

The authors choose an importance-sampling for the second term, but this is where I get confused. They denote their importance weighting function $q_\eta (z \notin C_k)$ which *could* mean all of the probability mass *not* under the states in $C_k$? Later, they define a decision variable $b$ that expresses whether we are in this set or not, and it's sampled with probability $q_\eta (z \notin C_k)$, so I think my interpretation is correct. The gradient estimator then becomes:

$$
\hat{g} (v) = \sum_{z \in C_k} q_\eta (z) g(z) + q_\eta (z \notin C_k) g(v)\\
v \sim q_\eta | v \notin C_k
$$

[product]: https://en.wikipedia.org/wiki/Product_rule

Showing this is Rao-Blackwellization
----------------------------------------------

Another way to express $z$ would be to sample a Bernoulli r.v. with probability $\sum_{j \notin C_k} q_\eta (j) $, then if it's $1$ sample from $z \in C_k$ and if it's $0$ sample from $z \notin C_k$. As long as those samples are drawn using $q_\eta$ then:

$$
T(u,v,b) \stackrel{d}{=} z \\
T := u^{1-b} v^b
$$
where $u \sim q_\eta | C_k$, $v \sim q_\eta | v \notin C_k$ and $b \sim \text{Bernoulli}(\sum_{j \notin C_k} q_\eta (j))$.

Expressing $z$ in this way means the gradient estimator from before can be written as:

$$
\hat{g} (v) = \mathbb{E} \left[ g( T(u,v,b) ) | v \right]
$$
And they left it as an exercise for the reader to expand that out and show it's the same as equation 6:
$$
\mathbb{E} \left[ g( T(u,v,b) ) | v \right] = \mathbb{E} \left[ g( T(u,v,b)) \mathbb{1} \{ b=0 \} + g( T(u,v,b)) \mathbb{1} \{ b=1 \} \right] \\
=  \mathbb{E} \left[ g(z) \mathbb{1} \{ z \in C_k \} + g( z) \mathbb{1} \{ z \notin C_k \} \right] =  \mathbb{E} \left[ g(z) \right] 
$$

Writing the estimator as a conditional expectation of some statistic of the random variables under the distribution is sufficient to show that this is an instance of Rao-Blackwellization. To be safe, the authors also apply the [conditional variance decomposition][eve] to reinforce the property that RB estimators always have lower variance:

$$
Var(Y) = E\left[ Var (Y|X) \right] + Var(E \left[ Y | X \right] ) \\
Var(g (z) ) = Var (\mathbb{E} \left[ g( T(u,v,b) ) | v \right] ) + E \left[ Var ( g( T(u,v,b) ) | v ) \right] \\
Var (\mathbb{E} \left[ g( T(u,v,b) ) | v \right] ) = Var (\hat{g} (v) ) = Var(g (z) ) -  E \left[ Var ( g( T(u,v,b) ) | v ) \right]
$$

They go on to show that the variance is less than or equal to $Var(g(z)) \sum_{j \notin C_k} q_\eta (j)$.

Finally, they note that the variance of a simple estimator can also be reduced by taking multiple samples and averaging. They then provide an equation to calculate the optimal $k$ number of elements of $z$ to evaluate depending on how concentrated the distribution being evaluated is, and a proof showing that this will have a lower variance than the naive estimator.

$$
\hat{k} = \underset{k \in {0, ..., N}}{\operatorname{argmin}} \frac{\sum_{j \notin C_k} q_\eta (j)}{N-k}
$$

I'm not very interested in the experiments right now, but skimming through them it's interesting to see that this method performs very well on a high dimensional hard attention task on MNIST. Particularly because a Gumbel-softmax estimator falls apart in the same experiment. It would be nice to see results on RL problems as were shown in the [RELAX][] paper.

[eve]: https://en.wikipedia.org/wiki/Law_of_total_variance
[relax]: https://arxiv.org/abs/1711.00123