Welcome to ShortScience.org! |
[link]
This is a very techincal paper and I only covered items that interested me * Model * Encoder * 8 layers LSTM * bi-directional only first encoder layer * top 4 layers add input to output (residual network) * Decoder * same as encoder except all layers are just forward direction * encoder state is not passed as a start point to Decoder state * Attention * energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer * computed from output of 1st decoder layer * pre-feed to all layers * Training has two steps: ML and RL * ML (cross-entropy) training: * common wisdom, initialize all trainable parameters uniformly between [-0.04, 0.04] * clipping=5, batch=128 * Adam (lr=2e-4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps) * 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days * [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.2-0.3 (higher for smaller datasets) * RL - [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15) * sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on n-grams of size 1-4 * mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$ * mean $r$ computed from $m=15$ samples * SGD, 400K steps, 3 days, no drouput * Prediction (i.e. Decoder) * beam search (3 beams) * A normalized score is computed to every beam that ended (died) * did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.6-0.7]$ * normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sum-log of attention weight of every input word (i.e. after summing over all output words) * Do a second pruning using normalized scores |
[link]
Li et al. propose an adversarial attack motivated by second-order optimization and uses input randomization as defense. Based on a Taylor expansion, the optimal adversarial perturbation should be aligned with the dominant eigenvector of the Hessian matrix of the loss. As the eigenvectors of the Hessian cannot be computed efficiently, the authors propose an approximation; this is mainly based on evaluating the gradient under Gaussian noise. The gradient is then normalized before taking a projected gradient step. As defense, the authors inject random noise on the input (clean example or adversarial example) and compute the average prediction over multiple iterations. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
In the future, AI and people will work together; hence, we must concern ourselves with ensuring that AI will have interests aligned with our own. The authors suggest that it is in our best interests to find a solution to the "value-alignment problem". As recently pointed out by Ian Goodfellow, however, [this may not always be a good idea](https://www.quora.com/When-do-you-expect-AI-safety-to-become-a-serious-issue). Cooperative Inverse Reinforcement Learning (CIRL) is a formulation of a cooperative, partial information game between a human and a robot. Both share a reward function, but the robot does not initially know what it is. One of the key departures from classical Inverse Reinforcement Learning is that the teacher, which in this case is the human, is not assumed to act optimally. Rather, it is shown that sub-optimal actions on the part of the human can result in the robot learning a better reward function. The structure of the CIRL formulation is such that it should encourage the human to not attempt to teach by demonstration in a way that greedily maximizes immediate reward. Rather, the human learns how to "best respond" to the robot. CIRL can be formulated as a dec-POMDP, and reduced to a single-agent POMDP. The authors solved a 2D navigation task with CIRL to demonstrate the inferiority of having the human follow a "demonstration-by-expert" policy as opposed to a "best-response" policy. |
[link]
Lee et al. propose a variant of adversarial training where a generator is trained simultaneously to generated adversarial perturbations. This approach follows the idea that it is possible to “learn” how to generate adversarial perturbations (as in [1]). In this case, the authors use the gradient of the classifier with respect to the input as hint for the generator. Both generator and classifier are then trained in an adversarial setting (analogously to generative adversarial networks), see the paper for details. [1] Omid Poursaeed, Isay Katsman, Bicheng Gao, Serge Belongie. Generative Adversarial Perturbations. ArXiv, abs/1712.02328, 2017. |
[link]
This paper starts by introducing a trick to reduce the variance of stochastic gradient variational Bayes (SGVB) estimators. In neural networks, SGVB consists in learning a variational (e.g. diagonal Gaussian) posterior over the weights and biases of neural networks, through a procedure that (for the most part) alternates between adding (Gaussian) noise to the model's parameters and then performing a model update with backprop. The authors present a local reparameterization trick, which exploits the fact that the Gaussian noise added into the weights could instead be added directly into the pre-activation (i.e. before the activation fonction) vectors during forward propagation. This is due to the fact that computing the pre-activation is a linear operation, thus noise at that level is also Gaussian. The advantage of doing so is that, in the context of minibatch training, one can efficiently then add independent noise to the pre-activation vectors for each example of the minibatch. The nature of the local reparameterization trick implies that this is equivalent to using one corrupted version of the weights for each example in the minibatch, something that wouldn't be practical computationally otherwise. This is in fact why, in normal SGVB, previous work would normally use a single corrupted version of the weights for all the minibatch. The authors demonstrate that using the local reparameterization trick yields stochastic gradients with lower variance, which should improve the speed of convergence. Then, the authors demonstrate that the Gaussian version of dropout (one that uses multiplicative Gaussian noise, instead of 0-1 masking noise) can be seen as the local reparameterization trick version of a SGVB objective, with some specific prior and variational posterior. In this SGVB view of Gaussian dropout, the dropout rate is an hyper-parameter of this prior, which can now be tuned by optimizing the variational lower bound of SGVB. In other words, we now have a method to also train the dropout rate! Moreover, it becomes possible to tune an individual dropout rate parameter for each layer, or even each parameter of the model. Experiments on MNIST confirm that tuning that parameter works and allows to reach good performance of various network sizes, compared to using a default dropout rate. ##### My two cents This is another thought provoking connection between Bayesian learning and dropout. Indeed, while Deep GPs have allowed to make a Bayesian connection with regular (binary) dropout learning \cite{journals/corr/GalG15}, this paper sheds light on a neat Bayesian connection for the Gaussian version of dropout. This is great, because it suggests that Gaussian dropout training is another legit way of modeling uncertainty in the parameters of neural networks. It's also nice that that connection also yielded a method for tuning the dropout rate automatically. I hope future work (by the authors or by others) can evaluate the quality of the corresponding variational posterior in terms of estimating uncertainty in the network and, in particular, in obtaining calibrated output probabilities. Little detail: I couldn't figure out whether the authors tuned a single dropout rate for the whole network, or used many rates, for instance one per parameter, as they suggest can be done. |