Welcome to ShortScience.org! |
[link]
This paper describes how to apply the idea of batch normalization (BN) successfully to recurrent neural networks, specifically to LSTM networks. The technique involves the 3 following ideas: **1) Careful initialization of the BN scaling parameter.** While standard practice is to initialize it to 1 (to have unit variance), they show that this situation creates problems with the gradient flow through time, which vanishes quickly. A value around 0.1 (used in the experiments) preserves gradient flow much better. **2) Separate BN for the "hiddens to hiddens pre-activation and for the "inputs to hiddens" pre-activation.** In other words, 2 separate BN operators are applied on each contributions to the pre-activation, before summing and passing through the tanh and sigmoid non-linearities. **3) Use of largest time-step BN statistics for longer test-time sequences.** Indeed, one issue with applying BN to RNNs is that if the input sequences have varying length, and if one uses per-time-step mean/variance statistics in the BN transformation (which is the natural thing to do), it hasn't been clear how do deal with the last time steps of longer sequences seen at test time, for which BN has no statistics from the training set. The paper shows evidence that the pre-activation statistics tend to gradually converge to stationary values over time steps, which supports the idea of simply using the training set's last time step statistics. Among these ideas, I believe the most impactful idea is 1). The papers mentions towards the end that improper initialization of the BN scaling parameter probably explains previous failed attempts to apply BN to recurrent networks. Experiments on 4 datasets confirms the method's success. **My two cents** This is an excellent development for LSTMs. BN has had an important impact on our success in training deep neural networks, and this approach might very well have a similar impact on the success of LSTMs in practice. |
[link]
This paper presents a feed-forward neural network architecture for processing graphs as inputs, inspired from previous work on Graph Neural Networks. In brief, the architecture of the GG-NN corresponds to $T$ steps of GRU-like (gated recurrent units) updates, where T is a hyper-parameter. At each step, a vector representation is computed for all nodes in the graph, where a node's representation at step t is computed from the representation of nodes at step $t-1$. Specifically, the representation of a node will be updated based on the representation of its neighbors in the graph. Incoming and outgoing edges in the graph are treated differently by the neural network, by using different parameter matrices for each. Moreover, if edges have labels, separate parameters can be learned for the different types of edges (meaning that edge labels determine the configuration of parameter sharing in the model). Finally, GG-NNs can incorporate node-level attributes, by using them in the initialization (time step 0) of the nodes' representations. GG-NNs can be used to perform a variety of tasks on graphs. The per-node representations can be used to make per-node predictions by feeding them to a neural network (shared across nodes). A graph-level predictor can also be obtained using a soft attention architecture, where per-node outputs are used as scores into a softmax in order to pool the representations across the graph, and feed this graph-level representation to a neural network. The attention mechanism can be conditioned on a "question" (e.g. on a task to predict the shortest path in a graph, the question would be the identity of the beginning and end nodes of the path to find), which is fed to the node scorer of the soft attention mechanism. Moreover, the authors describe how to chain GG-NNs to go beyond predicting individual labels and predict sequences. Experiments on several datasets are presented. These include tasks where a single output is required (on a few bAbI tasks) as well as tasks where a sequential output is required, such as outputting the shortest path or the Eulerian circuit of a graph. Moreover, experiments on a much more complex and interesting program verification task are presented. |
[link]
Reinforcement Learning is often broadly separated into two categories of approaches: model-free and model-based. In the former category, networks simply take observations and input and produce predicted best-actions (or predicted values of available actions) as output. In order to perform well, the model obviously needs to gain an understanding of how its actions influence the world, but it doesn't explicitly make predictions about what the state of the world will be after an action is taken. In model-based approaches, the agent explicitly builds a dynamics model, or a model in which it takes in (past state, action) and predicts next state. In theory, learning such a model can lead to both interpretability (because you can "see" what the model thinks the world is like) and robustness to different reward functions (because you're learning about the world in a way not explicitly tied up with the reward). This paper proposes an interesting melding of these two paradigms, where an agent learns a model of the world as part of an end-to-end policy learning. This works through something the authors call "observational dropout": the internal model predicts the next state of the world given the prior one and the action, and then with some probability, the state of the world that both the policy and the next iteration of the dynamics model sees is replaced with the model's prediction. This incentivizes the network to learn an effective dynamics model, because the farther the predictions of the model are from the true state of the world, the worse the performance of the learned policy will be on the iterations where the only observation it can see is the predicted one. So, this architecture is model-free in the sense that the gradient used to train the system is based on applying policy gradients to the reward, but model-based in the sense that it does have an internal world representation. https://i.imgur.com/H0TNfTh.png The authors find that, at a simple task, Swing Up Cartpole, very low probabilities of seeing the true world (and thus very high probabilities of the policy only seeing the dynamics model output) lead to world models good enough that a policy trained only on trajectories sampled from that model can perform relatively well. This suggests that at higher probabilities of the true world, there was less value in the dynamics model being accurate, and consequently less training signal for it. (Of course, policies that often could only see the predicted world performed worse during their original training iteration compared to policies that could see the real world more frequently). On a more complex task of CarRacing, the authors looked at how well a policy trained using the representations of the world model as input could perform, to examine whether it was learning useful things about the world. https://i.imgur.com/v9etll0.png They found an interesting trade-off, where at high probabilities (like before) the dynamics model had little incentive to be good, but at low probabilities it didn't have enough contact with the real dynamics of the world to learn a sensible policy. |
[link]
Papernot et al. Introduce a novel attack on deep networks based on so-called adversarial saliency maps that are computed independently of a loss. Specifically, they consider – for a given network $F(X)$ – the forward derivative $\nabla F = \frac{\partial F}{\partial X} = \left[\frac{\partial F_j(X)}{\partial x_i}\right]_{i,j}$. Essentially, this is the regular derivative of $F$ with respect to its input; Papernot et al. seem to refer to is as “forward” derivative as it stands in contrast with regular backpropagation where the derivative of the loss with respect to the parameters is considered. They define an adversarial saliency map by considering $S(X, t)_i = \begin{cases}0 & \text{ if } \frac{\partial F_t(X)}{\partial X_i} < 0 \text{ or } \sum_{j\neq t} \frac{\partial F_j(X)}{\partial X_i} > 0\\ \left(\frac{\partial F_t(X)}{\partial X_i}\right) \left| \sum_{j \neq t} \frac{\partial F_j(X)}{\partial X_i}\right| & \text{ otherwise}\end{cases}$ where $t$ is the target class of the attack. The intuition of this definition is the following: The partial derivative of $F_t$ with respect to $X$ at location $i$ indicates how $X_i$ can be changed in order to increase $F_t$ (which is the goal). At the same time, $F_j$ for all $t \neq j$ is supposed to decrease for the targeted attack, this is implemented using the second (absolute) term. If, at a specific feature $X_i$, not increase of $X_i$ will lead to an increase of $F_t$, or an increase will also lead to an increase in the other $F_j$, the saliency map is zero – indicating that feature $i$ is useless. Note that here, only increases in $X_i$ are considered; Papernot et al. have a analogous formulation for considering decreases of $X_i$. Based on the concept of adversarial saliency maps, a simple attack is implemented as illustrated in Algorithm 1. In particular, the feature $X_i$ for which the saliency map $S(X, t)$ is maximized is chosen and increased by a fixed amount until the network $F$ changes the label to $t$ or a maximum perturbation is reached (in which case the attack fails). https://i.imgur.com/PvJv9yS.png Algorithm 1: The proposed algorithm for generating adversarial examples, see text for details. In experiments on MNIST they show the effectiveness of the proposed attack. Additionally, they attempt to quantify the robustness (called “hardness”) of specific classes. In particular, they show that some classes are harder to attack than others. To this end they derive the so-called adversarial distance $A(X, t) = 1 - \frac{1}{M}\sum_i 1_{[S(X, t)_i > 0]}$ which counts the number of features in the adversarial saliency map that are greater than zero (i.e. can be perturbed during the attack in Algorithm 1). Personally, I find this “hardness” measure quite interesting because it is independent of a specific loss, but directly takes statistics of the learned model into account. Also see this summary on [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
## General Framework Extends T-REX (see [summary](https://www.shortscience.org/paper?bibtexKey=journals/corr/1904.06387&a=muntermulehitch)) so that preferences (rankings) over demonstrations are generated automatically (back to the common IL/IRL setting where we only have access to a set of unlabeled demonstrations). Also derives some theoretical requirements and guarantees for better-than-demonstrator performance. ## Motivations * Preferences over demonstrations may be difficult to obtain in practice. * There is no theoretical understanding of the requirements that lead to outperforming demonstrator. ## Contributions * Theoretical results (with linear reward function) on when better-than-demonstrator performance is possible: 1- the demonstrator must be suboptimal (room for improvement, obviously), 2- the learned reward must be close enough to the reward that the demonstrator is suboptimally optimizing for (be able to accurately capture the intent of the demonstrator), 3- the learned policy (optimal wrt the learned reward) must be close enough to the optimal policy (wrt to the ground truth reward). Obviously if we have 2- and a good enough RL algorithm we should have 3-, so it might be interesting to see if one can derive a requirement from only 1- and 2- (and possibly a good enough RL algo). * Theoretical results (with linear reward function) showing that pairwise preferences over demonstrations reduce the error and ambiguity of the reward learning. They show that without rankings two policies might have equal performance under a learned reward (that makes expert's demonstrations optimal) but very different performance under the true reward (that makes the expert optimal everywhere). Indeed, the expert's demonstration may reveal very little information about the reward of (suboptimal or not) unseen regions which may hurt very much the generalizations (even with RL as it would try to generalize to new states under a totally wrong reward). They also show that pairwise preferences over trajectories effectively give half-space constraints on the feasible reward function domain and thus may decrease exponentially the reward function ambiguity. * Propose a practical way to generate as many ranked demos as desired. ## Additional Assumption Very mild, assumes that a Behavioral Cloning (BC) policy trained on the provided demonstrations is better than a uniform random policy. ## Disturbance-based Reward Extrapolation (D-REX) ![](https://i.imgur.com/9g6tOrF.png) ![](https://i.imgur.com/zSRlDcr.png) They also show that the more noise added to the BC policy the lower the performance of the generated trajs. ## Results Pretty much like T-REX. |