![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
#### Introduction * The paper proposes a general and end-to-end approach for sequence learning that uses two deep LSTMs, one to map input sequence to vector space and another to map vector to the output sequence. * For sequence learning, Deep Neural Networks (DNNs) requires the dimensionality of input and output sequences be known and fixed. This limitation is overcome by using the two LSTMs. * [Link to the paper](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) #### Model * Recurrent Neural Networks (RNNs) generalizes feed forward neural networks to sequences. * Given a sequence of inputs $(x_{1}, x_{2}...x_{t})$, RNN computes a sequence of outputs $(y_1, y_2...y_t')$ by iterating over the following equation: $$h_t = sigm(W^{hx}x_t + W^{hh} h_{t-1})$$ $$y^{t} = W^{yh}h_{t}$$ * To map variable length sequences, the input is mapped to a fixed size vector using an RNN and this fixed size vector is mapped to output sequence using another RNN. * Given the long-term dependencies between the two sequences, LSTMs are preferred over RNNs. * LSTMs estimate the conditional probability *p(output sequence | input sequence)* by first mapping the input sequence to a fixed dimensional representation and then computing the probability of output with a standard LST-LM formulation. ##### Differences between the model and standard LSTMs * The model uses two LSTMs (one for input sequence and another for output sequence), thereby increasing the number of model parameters at negligible computing cost. * Model uses Deep LSTMs (4 layers). * The words in the input sequences are reversed to introduce short-term dependencies and to reduce the "minimal time lag". By reversing the word order, the first few words in the source sentence (input sentence) are much closer to first few words in the target sentence (output sentence) thereby making it easier for LSTM to "establish" communication between input and output sentences. #### Experiments * WMT'14 English to French dataset containing 12 million sentences consisting of 348 million French words and 304 million English words. * Model tested on translation task and on the task of re-scoring the n-best results of baseline approach. * Deep LSTMs trained in sentence pairs by maximizing the log probability of a correct translation $T$, given the source sentence $S$ * The training objective is to maximize this log probability, averaged over all the pairs in the training set. * Most likely translation is found by performing a simple, left-to-right beam search for translation. * A hard constraint is enforced on the norm of the gradient to avoid the exploding gradient problem. * Min batches are selected to have sentences of similar lengths to reduce training time. * Model performs better when reversed sentences are used for training. * While the model does not beat the state-of-the-art, it is the first pure neural translation system to outperform a phrase-based SMT baseline. * The model performs well on long sentences as well with only a minor degradation for the largest sentences. * The paper prepares ground for the application of sequence-to-sequence based learning models in other domains by demonstrating how a simple and relatively unoptimised neural model could outperform a mature SMT system on translation tasks. ![]() |
[link]
Schmidt et al. theoretically and experimentally show that training adversarially robust models requires a higher sample complexity compared to regular generalization. Theoretically, they analyze two very simple families of datasets, e.g., consisting of two Gaussian distributions corresponding to a two-class problem. On such datasets, they proof that “robust generalization”, i.e., generalization to adversarial examples, required much higher sample complexity compared to regular generlization, i.e., generalization to the test set. These results are interesting because they suggest that the sample complexity might be even worse for more complex and realistic data distributions – as we commonly tackle in computer vision. Experimentally, they show similar result son MNIST, CIFAR-10 and SVHN. Varying the size of the training set and plotting the accuracy on adversarially computed examples results in Figure 1. As can be seen, there seems to be a clear advantage of having larger training sets. Note that these models were trained using adversarial training using an $L_\infty$ adversary constrained by the given $\epsilon$. https://i.imgur.com/SriBAt4.png Figure 1: Training set size plotted against the adversarial test accuracy on MNIST, CIFAR-10 and SVHN. The models were trained using adversarial training. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) ![]() |
[link]
This paper presents an end-to-end version of memory networks (Weston et al., 2015) such that the model doesn't train on the intermediate 'supporting facts' strong supervision of which input sentences are the best memory accesses, making it much more realistic. They also have multiple hops (computational steps) per output symbol. The tasks are Q&A and language modeling, and achieves strong results. The paper is a useful extension of memNN because it removes the strong, unrealistic supervision requirement and still performs pretty competitively. The architecture is defined pretty cleanly and simply. The related work section is quite well-written, detailing the various similarities and differences with multiple streams of related work. The discussion about the model's connection to RNNs is also useful. ![]() |
[link]
Rakelly et al. propose a method to do off-policy meta reinforcement learning (rl). The method achieves a 20-100x improvement on sample efficiency compared to on-policy meta rl like MAML+TRPO. The key difficulty for offline meta rl arises from the meta-learning assumption, that meta-training and meta-test time match. However during test time the policy has to explore and sees as such on-policy data which is in contrast to the off-policy data that should be used at meta-training. The key contribution of PEARL is an algorithm that allows for online task inference in a latent variable at train and test time, which is used to train a Soft Actor Critic, a very sample efficient off-policy algorithm, with additional dependence of the latent variable. The implementation of Rakelly et al. proposes to capture knowledge about the current task in a latent stochastic variable Z. A inference network $q_{\Phi}(z \vert c)$ is used to predict the posterior over latents given context c of the current task in from of transition tuples $(s,a,r,s')$ and trained with an information bottleneck. Note that the task inference is done on samples according to a sampling strategy sampling more recent transitions. The latent z is used as an additional input to policy $\pi(a \vert s, z)$ and Q-function $Q(a,s,z)$ of a soft actor critic algorithm which is trained with offline data of the full replay buffer. https://i.imgur.com/wzlmlxU.png So the challenge of differing conditions at test and train times is resolved by sampling the content for the latent context variable at train time only from very recent transitions (which is almost on-policy) and at test time by construction on-policy. Sampling $z \sim q(z \vert c)$ at test time allows for posterior sampling of the latent variable, yielding efficient exploration. The experiments are performed across 6 Mujoco tasks with ProMP, MAML+TRPO and $RL^2$ with PPO as baselines. They show: - PEARL is 20-100x more sample-efficient - the posterior sampling of the latent context variable enables deep exploration that is crucial for sparse reward settings - the inference network could be also a RNN, however it is crucial to train it with uncorrelated transitions instead of trajectories that have high correlated transitions - using a deterministic latent variable, i.e. reducing $q_{\Phi}(z \vert c)$ to a point estimate, leaves the algorithm unable to solve sparse reward navigation tasks which is attributed to the lack of temporally extended exploration. The paper introduces an algorithm that allows to combine meta learning with an off-policy algorithm that dramatically increases the sample-efficiency compared to on-policy meta learning approaches. This increases the chance of seeing meta rl in any sort of real world applications. ![]() |