[link]
Summary by senior author [duvenaud on hackernews](https://news.ycombinator.com/item?id=18678078). A few years ago, everyone switched their deep nets to "residual nets". Instead of building deep models like this: h1 = f1(x) h2 = f2(h1) h3 = f3(h2) h4 = f3(h3) y = f5(h4) They now build them like this: h1 = f1(x) + x h2 = f2(h1) + h1 h3 = f3(h2) + h2 h4 = f4(h3) + h3 y = f5(h4) + h4 Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once. In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods. We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net. Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time. |
[link]
**Summary**: This paper presents three tricks that make model-based reinforcement more reliable when tested in tasks that require walking and balancing. The tricks are 1) are planning based on features, 2) using a recursive network that mixes probabilistic and deterministic information, and 3) looking forward multiple steps. **Longer summary** Imagine playing pool, armed with a tablet that can predict exactly where the ball will bounce, and the next bounce, and so on. That would be a huge advantage to someone learning pool, however small inaccuracies in the model could mislead you especially when thinking ahead to the 2nd and third bounce. The tablet is analogous to the dynamics model in model-based reinforcement learning (RL). Model based RL promises to solve a lot of the open problems with RL, letting the agent learn with less experience, transfer well, dream, and many others advantages. Despite the promise, dynamics models are hard to get working: they often suffer from even small inaccuracies, and need to be redesigned for specific tasks. Enter PlaNet, a clever name, and a net that plans well in range of environments. To increase the challenge the model must predict directly from pixels in fairly difficult tasks such as teaching a cheetah to run or balancing a ball in a cup. How do they do this? Three main tricks. - Planning in latest space: this means that the policy network doesn't need to look at the raw image, but looks at a summary of it as represented by a feature vector. - Recurrent state space models: They found that probabilistic information helps describe the space of possibilities but makes it harder for their RNN based model to look back multiple steps. However mixing probabilistic information and deterministic information gives it the best of both worlds, and they have results that show a starting performance increase when both compared to just one. - Latent overshooting: They train the model to look more than one step ahead, this helps prevent errors that build up over time Overall this paper shows great results that tackle the shortfalls of model based RL. I hope the results remain when tested on different and more complex environments. |
[link]
Catastrophic forgetting is the tendency of an neural network to forget previously learned information when learning new information. This paper combats that by keeping a buffer of experience and applying meta-learning to it. They call their new module Meta Experience Replay or MER. How does this work? At each update they compute multiple possible updates to the model weights. One for the new batch of information and some more updates for batches of previous experience. Then they apply meta-learning using the REPTILE algorithm, here the meta-model sees each possible update and has to predict the output which combines them with the least interference. This is done by predicting an update vector that maximizes the dot product between the new and old update vectors, that way it transfers as much learning as possible from the new update without interfering with the old updates. https://i.imgur.com/TG4mZOn.png Does it work? Yes, while it may take longer to train, the results show that it generalizes better and needs a much smaller buffer of experience than the popular approach of using replay buffers. |
[link]
This paper proposed three new reinforcement learning tasks which involved dealing with images. - Task 1: An agent crawls across a hidden image, revealing portions of it at each step. It must classify the image in the minimum amount of steps. For example classify the image as a cat after choosing to travel across the ears. - Task 2: The agent crawls across a visible image to sit on it's target. For example a cat in a scene of pets. - Task 3: The agent plays an Atari game where the background has been replaced with a distracting video. These tasks are easy to construct, but solving them requires large scale visual processing or attention, which typically require deep networks. To address these new tasks, popular RL agents (PPO, A2C, and ACKTR) were augmented with a deep image processing network (ResNet-18), but they still performed poorly. |
[link]
How can humans help an agent perform at a task that has no clear reward? Imitation, demonstration, and preferences. This paper asks which combinations of imitation, demonstration, and preferences will best guide an agent in Atari games. For example an agent that is playing Pong on the Atari, but can't access the score. You might help it by demonstrating your play style for a few hours. To help the agent further you are shown two short clips of it playing and you are asked to indicate which one, if any, you prefer. To avoid spending many hours rating videos the authors sometimes used an automated approach where the game's score decides which clip is preferred, but they also compared this approach to human preferences. It turns out that human preferences are often worse because of reward traps. These happen, for example, when the human tries to encourage the agent to explore ladders, resulting in the agent obsessing about ladders instead of continuing the game. They also observed that the agent often misunderstood the preferences it was given, causing unexpected behavior called reward hacking. The only solution they mention was to have someone keep an eye on it and continue giving it preferences, but this isn't always feasible. This is the alignment problem which is a hard problem in AGI research. Results: adding merely a few thousand preferences can help in most games, unless they have sparse rewards. Demonstrations, on the other hand, tend to help those games with sparse rewards but only if the demonstrator is good at the game. |
[link]
This builds on the previous ["MERLIN"](https://arxiv.org/abs/1803.10760) paper. First they introduce the RMA agent, which is a simplified version of MERLIN which uses model based RL and long term memory. They give the agent long term memory by letting it choose to save and load the agent's working memory (represented by the LSTM's hidden state). Then they add credit assignment, similar to the RUDDER paper, to get the "Temporal Value Transport" (TVT) agent that can plan long term in the face of distractions. **The critical insight here is that they use the agent's memory access to decide on credit assignment**. So if the model uses a memory from 512 steps ago, that action from 512 steps ago gets lots of credit for the current reward. They use various tasks, for example a maze with a distracting task then a memory retrieval task. For example, after starting in a maze with, say, a yellow wall, the agent needs to collect apples. This serves as a distraction, ensuring the agent can recall memories even after distraction. At the end of the maze it needs to remember that initial color (e.g. yellow) in order to choose the exit of the correct color. They include performance graphs showing that memory or even better memory plus credit assignment are a significant help in this, and similar, tasks. |
[link]
[Summary by author /u/SirJAM_armedi](https://www.reddit.com/r/MachineLearning/comments/8sq0jy/rudder_reinforcement_learning_algorithm_that_is/e11swv8/). Math aside, the "big idea" of RUDDER is the following: We use an LSTM to predict the return of an episode. To do this, the LSTM will have to recognize what actually causes the reward (e.g. "shooting the gun in the right direction causes the reward, even if we get the reward only once the bullet hits the enemy after travelling along the screen"). We then use a salience method (e.g. LRP or integrated gradients) to get that information out of the LSTM, and redistribute the reward accordingly (i.e., we then give reward already once the gun is shot in the right direction). Once the reward is redistributed this way, solving/learning the actual Reinforcement Learning problem is much, much easier and as we prove in the paper, the optimal policy does not change with this redistribution. |
[link]
**TL;DR:** There are 'place cells' in the hippopotamus that are fired when passing through a location. You can take a rat and measure how its cells are activated in a maze, then monitor neurons during planning, rest or sleep. You'll see patterns that show it's thinking of locations in order and focusing on interesting locations. This paper looks at how RL agents do 'prioritized experience replay' and compare it to place cells in animals. The authors do a RL simulation and *qualitatively* compare the results to the activity observed in place cells. **Key paragraphs:** > Neural activity recorded from hippocampal place cells during spatial navigation typically represents the animal’s spatial position, though it can sometimes represent locations ahead of the animal. For instance, during “sharp wave ripple” events, activity might progress sequentially from the animal’s current location towards a goal location. These “forward replay” ´sequences predict subsequent behavior and have been suggested to support a planning mechanism that links actions to their deferred consequences along a spatial trajectory. However, analogously to the human evidence, remote activity in the hippocampus can also represent locations behind the animal, and even altogether disjoint, ´remote locations (especially during rest or sleep) (Fig. 1a). ... > we develop a normative theory to predict not just whether but which memories should be accessed at each time to enable the most rewarding future decisions. ... > To test the implications of our theory, we simulate a spatial navigation task where an agent generates and stores experiences which can be later retrieved. We show that an agent that accesses memories sequentially and in order of utility produces patterns of sequential state consideration that resemble place cell replay, and reproduces qualitatively and with no parameter fitting a wealth of empirical findings including (i) the existence and balance between forward and reverse replay; (ii) the content of replay; and (iii) effects of experience. ... > we propose the unifying view that all patterns of replay during behavior, rest, and sleep reflect different instances of a more general state retrieval operation that integrates experiences across space and time to propagate value and guide decisions. **My 2 cents**: I like this paper because prioritized experience replay reminds me of how we often dream or daydream of novel good or bad events that happened or that we anticipate. This paper drills much deeper into this connection. |