![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
#### Introduction * The paper proposes a general and end-to-end approach for sequence learning that uses two deep LSTMs, one to map input sequence to vector space and another to map vector to the output sequence. * For sequence learning, Deep Neural Networks (DNNs) requires the dimensionality of input and output sequences be known and fixed. This limitation is overcome by using the two LSTMs. * [Link to the paper](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) #### Model * Recurrent Neural Networks (RNNs) generalizes feed forward neural networks to sequences. * Given a sequence of inputs $(x_{1}, x_{2}...x_{t})$, RNN computes a sequence of outputs $(y_1, y_2...y_t')$ by iterating over the following equation: $$h_t = sigm(W^{hx}x_t + W^{hh} h_{t-1})$$ $$y^{t} = W^{yh}h_{t}$$ * To map variable length sequences, the input is mapped to a fixed size vector using an RNN and this fixed size vector is mapped to output sequence using another RNN. * Given the long-term dependencies between the two sequences, LSTMs are preferred over RNNs. * LSTMs estimate the conditional probability *p(output sequence | input sequence)* by first mapping the input sequence to a fixed dimensional representation and then computing the probability of output with a standard LST-LM formulation. ##### Differences between the model and standard LSTMs * The model uses two LSTMs (one for input sequence and another for output sequence), thereby increasing the number of model parameters at negligible computing cost. * Model uses Deep LSTMs (4 layers). * The words in the input sequences are reversed to introduce short-term dependencies and to reduce the "minimal time lag". By reversing the word order, the first few words in the source sentence (input sentence) are much closer to first few words in the target sentence (output sentence) thereby making it easier for LSTM to "establish" communication between input and output sentences. #### Experiments * WMT'14 English to French dataset containing 12 million sentences consisting of 348 million French words and 304 million English words. * Model tested on translation task and on the task of re-scoring the n-best results of baseline approach. * Deep LSTMs trained in sentence pairs by maximizing the log probability of a correct translation $T$, given the source sentence $S$ * The training objective is to maximize this log probability, averaged over all the pairs in the training set. * Most likely translation is found by performing a simple, left-to-right beam search for translation. * A hard constraint is enforced on the norm of the gradient to avoid the exploding gradient problem. * Min batches are selected to have sentences of similar lengths to reduce training time. * Model performs better when reversed sentences are used for training. * While the model does not beat the state-of-the-art, it is the first pure neural translation system to outperform a phrase-based SMT baseline. * The model performs well on long sentences as well with only a minor degradation for the largest sentences. * The paper prepares ground for the application of sequence-to-sequence based learning models in other domains by demonstrating how a simple and relatively unoptimised neural model could outperform a mature SMT system on translation tasks. ![]() |
[link]
Rakelly et al. propose a method to do off-policy meta reinforcement learning (rl). The method achieves a 20-100x improvement on sample efficiency compared to on-policy meta rl like MAML+TRPO. The key difficulty for offline meta rl arises from the meta-learning assumption, that meta-training and meta-test time match. However during test time the policy has to explore and sees as such on-policy data which is in contrast to the off-policy data that should be used at meta-training. The key contribution of PEARL is an algorithm that allows for online task inference in a latent variable at train and test time, which is used to train a Soft Actor Critic, a very sample efficient off-policy algorithm, with additional dependence of the latent variable. The implementation of Rakelly et al. proposes to capture knowledge about the current task in a latent stochastic variable Z. A inference network $q_{\Phi}(z \vert c)$ is used to predict the posterior over latents given context c of the current task in from of transition tuples $(s,a,r,s')$ and trained with an information bottleneck. Note that the task inference is done on samples according to a sampling strategy sampling more recent transitions. The latent z is used as an additional input to policy $\pi(a \vert s, z)$ and Q-function $Q(a,s,z)$ of a soft actor critic algorithm which is trained with offline data of the full replay buffer. https://i.imgur.com/wzlmlxU.png So the challenge of differing conditions at test and train times is resolved by sampling the content for the latent context variable at train time only from very recent transitions (which is almost on-policy) and at test time by construction on-policy. Sampling $z \sim q(z \vert c)$ at test time allows for posterior sampling of the latent variable, yielding efficient exploration. The experiments are performed across 6 Mujoco tasks with ProMP, MAML+TRPO and $RL^2$ with PPO as baselines. They show: - PEARL is 20-100x more sample-efficient - the posterior sampling of the latent context variable enables deep exploration that is crucial for sparse reward settings - the inference network could be also a RNN, however it is crucial to train it with uncorrelated transitions instead of trajectories that have high correlated transitions - using a deterministic latent variable, i.e. reducing $q_{\Phi}(z \vert c)$ to a point estimate, leaves the algorithm unable to solve sparse reward navigation tasks which is attributed to the lack of temporally extended exploration. The paper introduces an algorithm that allows to combine meta learning with an off-policy algorithm that dramatically increases the sample-efficiency compared to on-policy meta learning approaches. This increases the chance of seeing meta rl in any sort of real world applications. ![]() |
[link]
Wu et al. provide a framework (behavior regularized actor critic (BRAC)) which they use to empirically study the impact of different design choices in batch reinforcement learning (RL). Specific instantiations of the framework include BCQ, KL-Control and BEAR. Pure off-policy rl describes the problem of learning a policy purely from a batch $B$ of one step transitions collected with a behavior policy $\pi_b$. The setting allows for no further interactions with the environment. This learning regime is for example in high stake scenarios, like education or heath care, desirable. The core principle of batch RL-algorithms in to stay in some sense close to the behavior policy. The paper proposes to incorporate this firstly via a regularization term in the value function, which is denoted as **value penalty**. In this case the value function of BRAC takes the following form: $ V_D^{\pi}(s) = \sum_{t=0}^{\infty} \gamma ^t \mathbb{E}_{s_t \sim P_t^{\pi}(s)}[R^{pi}(s_t)- \alpha D(\pi(\cdot\vert s_t) \Vert \pi_b(\cdot \vert s_t)))], $ where $\pi_b$ is the maximum likelihood estimate of the behavior policy based upon $B$. This results in a Q-function objective: $\min_{Q} = \mathbb{E}_{\substack{(s,a,r,s') \sim D \\ a' \sim \pi_{\theta}(\cdot \vert s)}}\left[(r + \gamma \left(\bar{Q}(s',a')-\alpha D(\pi(\cdot\vert s) \Vert \pi_b(\cdot \vert s) \right) - Q(s,a) \right] $ and the corresponding policy update: $ \max_{\pi_{\theta}} \mathbb{E}_{(s,a,r,s') \sim D} \left[ \mathbb{E}_{a^{''} \sim \pi_{\theta}(\cdot \vert s)}[Q(s,a^{''})] - \alpha D(\pi(\cdot\vert s) \Vert \pi_b(\cdot \vert s) \right] $ The second approach is **policy regularization** . Here the regularization weight $\alpha$ is set for value-objectives (V- and Q) to zero and is non-zero for the policy objective. It is possible to instantiate for example the following batch RL algorithms in this setting: - BEAR: policy regularization with sample-based kernel MMD as D and min-max mixture of the two ensemble elements for $\bar{Q}$ - BCQ: no regularization but policy optimization over restricted space Extensive Experiments over the four Mujoco tasks Ant, HalfCheetah,Hopper Walker show: 1. for a BEAR like instantiation there is a modest advantage of keeping $\alpha$ fixed 2. using a mixture of a two or four Q-networks ensemble as target value yields better returns that using one Q-network 3. taking the minimum of ensemble Q-functions is slightly better than taking a mixture (for Ant, HalfCeetah & Walker, but not for Hooper 4. the use of value-penalty yields higher return than the policy-penalty 5. no choice for D (MMD, KL (primal), KL(dual) or Wasserstein (dual)) significantly outperforms the other (note that his contradicts the BEAR paper where MMD was better than KL) 6. the value penalty version consistently outperforms BEAR which in turn outperforms BCQ with improves upon a partially trained baseline. This large scale study of different design choices helps in developing new methods. It is however surprising to see, that most design choices in current methods are shown empirically to be non crucial. This points to the importance of agreeing upon common test scenarios within a community to prevent over-fitting new algorithms to a particular setting. ![]() |
[link]
TLDR; The authors propose "Highway Networks", which uses gates (inspired by LSTMs) to determine how much of a layer's activations to transform or just pass through. Highway Networks can be used with any kind of activation function, including recurrent and convnolutional units, and trained using plain SGD. The gating mechanism allows highway networks with tens or hundreds of layers to be trained efficiently. The authors show that highway networks with fewer parameters achieve results competitive with state-of-the art for the MNIST and CIFAR tasks. Gates outputs vary significantly with the input examples, demonstrating that the network not just learns a "fixed structure", but dynamically routes data based for specific examples examples. Datasets used: MNIST, CIFAR-10, CIFAR-100 #### Key Takeaways - Apply LSTM-like gating to networks layers. Transform gate T and carry gate C. - The gating forces the layer inputs/outputs to be of the same size. We can use additional plain layers for dimensionality transformations. - Bias weights of the transform gates should be initialized to negative values (-1, -2, -3, etc) to initially force the networks to pass through information and learn long-term dependencies. - HWN does not learn a fixed structure (same gate outputs), but dynamic routing based on current input. - In complex data sets each layer makes an important contritbution, which is shown by lesioning (setting to pass-through) individual layers. #### Notes / Questions - Seems like the authors did not use dropout in their experiments. I wonder how these play together. Is dropout less effective for highway networks because the gates already learn efficients paths? - If we see that certain gates outputs have low variance across examples, can we "prune" the network into a fixed strucure to make it more efficient (for production deployments)? ![]() |
[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) ![]() |