Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1583 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Sequence to Sequence Learning with Neural Networks

Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V.

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V.

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

[link]
#### Introduction * The paper proposes a general and end-to-end approach for sequence learning that uses two deep LSTMs, one to map input sequence to vector space and another to map vector to the output sequence. * For sequence learning, Deep Neural Networks (DNNs) requires the dimensionality of input and output sequences be known and fixed. This limitation is overcome by using the two LSTMs. * [Link to the paper](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) #### Model * Recurrent Neural Networks (RNNs) generalizes feed forward neural networks to sequences. * Given a sequence of inputs $(x_{1}, x_{2}...x_{t})$, RNN computes a sequence of outputs $(y_1, y_2...y_t')$ by iterating over the following equation: $$h_t = sigm(W^{hx}x_t + W^{hh} h_{t-1})$$ $$y^{t} = W^{yh}h_{t}$$ * To map variable length sequences, the input is mapped to a fixed size vector using an RNN and this fixed size vector is mapped to output sequence using another RNN. * Given the long-term dependencies between the two sequences, LSTMs are preferred over RNNs. * LSTMs estimate the conditional probability *p(output sequence | input sequence)* by first mapping the input sequence to a fixed dimensional representation and then computing the probability of output with a standard LST-LM formulation. ##### Differences between the model and standard LSTMs * The model uses two LSTMs (one for input sequence and another for output sequence), thereby increasing the number of model parameters at negligible computing cost. * Model uses Deep LSTMs (4 layers). * The words in the input sequences are reversed to introduce short-term dependencies and to reduce the "minimal time lag". By reversing the word order, the first few words in the source sentence (input sentence) are much closer to first few words in the target sentence (output sentence) thereby making it easier for LSTM to "establish" communication between input and output sentences. #### Experiments * WMT'14 English to French dataset containing 12 million sentences consisting of 348 million French words and 304 million English words. * Model tested on translation task and on the task of re-scoring the n-best results of baseline approach. * Deep LSTMs trained in sentence pairs by maximizing the log probability of a correct translation $T$, given the source sentence $S$ * The training objective is to maximize this log probability, averaged over all the pairs in the training set. * Most likely translation is found by performing a simple, left-to-right beam search for translation. * A hard constraint is enforced on the norm of the gradient to avoid the exploding gradient problem. * Min batches are selected to have sentences of similar lengths to reduce training time. * Model performs better when reversed sentences are used for training. * While the model does not beat the state-of-the-art, it is the first pure neural translation system to outperform a phrase-based SMT baseline. * The model performs well on long sentences as well with only a minor degradation for the largest sentences. * The paper prepares ground for the application of sequence-to-sequence based learning models in other domains by demonstrating how a simple and relatively unoptimised neural model could outperform a mature SMT system on translation tasks. |

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

Rakelly, Kate and Zhou, Aurick and Quillen, Deirdre and Finn, Chelsea and Levine, Sergey

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

Rakelly, Kate and Zhou, Aurick and Quillen, Deirdre and Finn, Chelsea and Levine, Sergey

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

[link]
Rakelly et al. propose a method to do off-policy meta reinforcement learning (rl). The method achieves a 20-100x improvement on sample efficiency compared to on-policy meta rl like MAML+TRPO. The key difficulty for offline meta rl arises from the meta-learning assumption, that meta-training and meta-test time match. However during test time the policy has to explore and sees as such on-policy data which is in contrast to the off-policy data that should be used at meta-training. The key contribution of PEARL is an algorithm that allows for online task inference in a latent variable at train and test time, which is used to train a Soft Actor Critic, a very sample efficient off-policy algorithm, with additional dependence of the latent variable. The implementation of Rakelly et al. proposes to capture knowledge about the current task in a latent stochastic variable Z. A inference network $q_{\Phi}(z \vert c)$ is used to predict the posterior over latents given context c of the current task in from of transition tuples $(s,a,r,s')$ and trained with an information bottleneck. Note that the task inference is done on samples according to a sampling strategy sampling more recent transitions. The latent z is used as an additional input to policy $\pi(a \vert s, z)$ and Q-function $Q(a,s,z)$ of a soft actor critic algorithm which is trained with offline data of the full replay buffer. https://i.imgur.com/wzlmlxU.png So the challenge of differing conditions at test and train times is resolved by sampling the content for the latent context variable at train time only from very recent transitions (which is almost on-policy) and at test time by construction on-policy. Sampling $z \sim q(z \vert c)$ at test time allows for posterior sampling of the latent variable, yielding efficient exploration. The experiments are performed across 6 Mujoco tasks with ProMP, MAML+TRPO and $RL^2$ with PPO as baselines. They show: - PEARL is 20-100x more sample-efficient - the posterior sampling of the latent context variable enables deep exploration that is crucial for sparse reward settings - the inference network could be also a RNN, however it is crucial to train it with uncorrelated transitions instead of trajectories that have high correlated transitions - using a deterministic latent variable, i.e. reducing $q_{\Phi}(z \vert c)$ to a point estimate, leaves the algorithm unable to solve sparse reward navigation tasks which is attributed to the lack of temporally extended exploration. The paper introduces an algorithm that allows to combine meta learning with an off-policy algorithm that dramatically increases the sample-efficiency compared to on-policy meta learning approaches. This increases the chance of seeing meta rl in any sort of real world applications. |

Training Very Deep Networks

Srivastava, Rupesh Kumar and Greff, Klaus and Schmidhuber, Jürgen

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

Srivastava, Rupesh Kumar and Greff, Klaus and Schmidhuber, Jürgen

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

[link]
TLDR; The authors propose "Highway Networks", which uses gates (inspired by LSTMs) to determine how much of a layer's activations to transform or just pass through. Highway Networks can be used with any kind of activation function, including recurrent and convnolutional units, and trained using plain SGD. The gating mechanism allows highway networks with tens or hundreds of layers to be trained efficiently. The authors show that highway networks with fewer parameters achieve results competitive with state-of-the art for the MNIST and CIFAR tasks. Gates outputs vary significantly with the input examples, demonstrating that the network not just learns a "fixed structure", but dynamically routes data based for specific examples examples. Datasets used: MNIST, CIFAR-10, CIFAR-100 #### Key Takeaways - Apply LSTM-like gating to networks layers. Transform gate T and carry gate C. - The gating forces the layer inputs/outputs to be of the same size. We can use additional plain layers for dimensionality transformations. - Bias weights of the transform gates should be initialized to negative values (-1, -2, -3, etc) to initially force the networks to pass through information and learn long-term dependencies. - HWN does not learn a fixed structure (same gate outputs), but dynamic routing based on current input. - In complex data sets each layer makes an important contritbution, which is shown by lesioning (setting to pass-through) individual layers. #### Notes / Questions - Seems like the authors did not use dropout in their experiments. I wonder how these play together. Is dropout less effective for highway networks because the gates already learn efficients paths? - If we see that certain gates outputs have low variance across examples, can we "prune" the network into a fixed strucure to make it more efficient (for production deployments)? |

Understanding deep learning requires rethinking generalization

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/11/10 (6 years ago)

**Abstract:** Despite their massive size, successful deep artificial neural networks can
exhibit a remarkably small difference between training and test performance.
Conventional wisdom attributes small generalization error either to properties
of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional
approaches fail to explain why large neural networks generalize well in
practice. Specifically, our experiments establish that state-of-the-art
convolutional networks for image classification trained with stochastic
gradient methods easily fit a random labeling of the training data. This
phenomenon is qualitatively unaffected by explicit regularization, and occurs
even if we replace the true images by completely unstructured random noise. We
corroborate these experimental findings with a theoretical construction showing
that simple depth two neural networks already have perfect finite sample
expressivity as soon as the number of parameters exceeds the number of data
points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.
more
less

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) |

Meta-Learning via Learned Loss

Sarah Bechtle and Artem Molchanov and Yevgen Chebotar and Edward Grefenstette and Ludovic Righetti and Gaurav Sukhatme and Franziska Meier

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, cs.AI, cs.RO, stat.ML

**First published:** 2019/06/12 (3 years ago)

**Abstract:** Typically, loss functions, regularization mechanisms and other important
aspects of training parametric models are chosen heuristically from a limited
set of options. In this paper, we take the first step towards automating this
process, with the view of producing models which train faster and more
robustly. Concretely, we present a meta-learning method for learning parametric
loss functions that can generalize across different tasks and model
architectures. We develop a pipeline for meta-training such loss functions,
targeted at maximizing the performance of the model trained under them. The
loss landscape produced by our learned losses significantly improves upon the
original task-specific losses in both supervised and reinforcement learning
tasks. Furthermore, we show that our meta-learning framework is flexible enough
to incorporate additional information at meta-train time. This information
shapes the learned loss function such that the environment does not need to
provide this information during meta-test time.
more
less

Sarah Bechtle and Artem Molchanov and Yevgen Chebotar and Edward Grefenstette and Ludovic Righetti and Gaurav Sukhatme and Franziska Meier

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, cs.AI, cs.RO, stat.ML

[link]
Bechtle et al. propose meta learning via learned loss ($ML^3$) and derive and empirically evaluate the framework on classification, regression, model-based and model-free reinforcement learning tasks. The problem is formalized as learning parameters $\Phi$ of a meta loss function $M_\phi$ that computes loss values $L_{learned} = M_{\Phi}(y, f_{\theta}(x))$. Following the outer-inner loop meta algorithm design the learned loss $L_{learned}$ is used to update the parameters of the learner in the inner loop via gradient descent: $\theta_{new} = \theta - \alpha \nabla_{\theta}L_{learned} $. The key contribution of the paper is the way to construct a differentiable learning signal for the loss parameters $\Phi$. The framework requires to specify a task loss $L_T$ during meta train time, which can be for example the mean squared error for regression tasks. After updating the model parameters to $\theta_{new}$ the task loss is used to measure how much learning progress has been made with loss parameters $\Phi$. The key insight is the decomposition via chain-rule of $\nabla_{\Phi} L_T(y, f_{\theta_{new}})$: $\nabla_{\Phi} L_T(y, f_{\theta_{new}}) = \nabla_f L_t \nabla_{\theta_{new}}f_{\theta_{new}} \nabla_{\Phi} \theta_{new} = \nabla_f L_t \nabla_{\theta_{new}}f_{\theta_{new}} [\theta - \alpha \nabla_{\theta} \mathbb{E}[M_{\Phi}(y, f_{\theta}(x))]]$. This allows to update the loss parameters with gradient descent as: $\Phi_{new} = \Phi - \eta \nabla_{\Phi} L_T(y, f_{\theta_{new}})$. This update rules yield the following $ML^3$ algorithm for supervised learning tasks: https://i.imgur.com/tSaTbg8.png For reinforcement learning the task loss is the expected future reward of policies induced by the policy $\pi_{\theta}$, for model-based rl with respect to the approximate dynamics model and for the model free case a system independent surrogate: $L_T(\pi_{\theta_{new}}) = -\mathbb{E}_{\pi_{\theta_{new}}} \left[ R(\tau_{\theta_{new}}) \log \pi_{\theta_{new}}(\tau_{new})\right] $. The allows further to incorporate extra information via an additional loss term $L_{extra}$ and to consider the augmented task loss $\beta L_T + \gamma L_{extra} $, with weights $\beta, \gamma$ at train time. Possible extra loss terms are used to add physics priors, encouragement of exploratory behavior or to incorporate expert demonstrations. The experiments show that this, at test time unavailable information, is retained in the shape of the loss landscape. The paper is packed with insightful experiments and shows that the learned loss function: - yields in regression and classification better accuracies at train and test tasks - generalizes well and speeds up learning in model based rl tasks - yields better generalization and faster learning in model free rl - is agnostic across a bunch of evaluated architectures (2,3,4,5 layers) - with incorporated extra knowledge yields better performance than without and is superior to alternative approaches like iLQR in a MountainCar task. The paper introduces a promising alternative, by learning the loss parameters, to MAML like approaches that learn the model parameters. It would be interesting to see if the learned loss function generalizes better than learned model parameters to a broader distribution of tasks like the meta-world tasks. |

About