[link]
## General Framework The take-home message is that the challenge of Reinforcement Learning for environments with high-dimensional and partial observations is learning a good representation of the environment. This means learning a sensory features extractor V to deal with the highly dimensional observation (pixels for example). But also learning a temporal representation M of the environment dynamics to deal with the partial observability. If provided with such representations, learning a controller so as to maximize a reward is really easy (single linear layer evolved with CMA-ES). Authors call these representations a *World model* since they can use the learned environment's dynamics to simulate roll-outs. They show that policies trained inside the world model transfer well back to the real environment provided that measures are taken to prevent the policy from exploiting the world model's inaccuracies. ## Method **Learning the World Model** ![](https://i.imgur.com/tgV17k4.png =600x) In this work they propose to learn these representations off-line in an unsupervized manner in order to be more efficient. They use a VAE for V that they train exclusively with the reconstruction loss, that way the learned representations are independent of the reward and can be used alongside any reward. They then train M as Mixture-Density-Network-RNN to predict the next sensory features (as extracted by the VAE) --and possibly the done condition and the reward-- and thus learn the dynamics of the environment in the VAE's latent space (which is likely simpler there than in the pixel space). Note that the VAE's latent space is a single Gaussian (adding stochasticity makes it more robust to the "next state" outputs of M), whereas M outputs next states in a mixture of Gaussians. Indeed, an image is likely to have one visual encoding, yet it can have multiple and different future scenarii which are captured by the multimodal output of M. **Training the policy** ![](https://i.imgur.com/H5vpb2H.png) * In the real env: The agent is provided with the visual features and M's hidden state (temporal features). * In the world model: To avoid that the agent exploits this imperfect simulator they increase its dynamics' stochasticity by playing with $\tau$ the sampling temperature of $z_{t+1}$ in M. ## Limitations If exploration is important in the environment the initial random policy might fail to collect data in all the relevant part of the environment and an iterative version of Algorithm 1 might be required (see https://worldmodels.github.io/ for a discussion on the different iterative methods) for the data collection. By training V independently of M it might fail to encode all the information relevant to the task. Another option would be to train V and M concurrently so that the reward and $z_{t+1}$'s prediction loss (or next state reconstruction loss) of M flows through V (that would also be trained with its own reconstruction loss). The trade-off is that now V is tuned to a particular reward and cannot be reused. The authors argue that since $h_t$ is such that it can predict $z_{t+1}$, it contains enough insight about the future for the agent not needing to *plan ahead* and just doing reflexive actions based on $h_t$. This is interesting but the considered tasks (driving, dodging fireball) are still very reflexive and do not require much planning. ## Results When trained on the true env, a simple controller with the V and M representations achieve SOTA on car-racing. V + M is better than V alone. When trained inside the world model, its dynamics' stochasticity must be tuned in order for the policy to transfer well and perform well on the real env: too little stochasticity and the agent overfits to the world model flaws and does not transfer to the real env, too much and the agent becomes risk-averse and robust but suboptimal. ![](https://i.imgur.com/e8ETSjQ.png) ## Additional ressources Thorough interactive blog post with additional experiments and discussions: https://worldmodels.github.io/ |
[link]
## General Framework Extends T-REX (see [summary](https://www.shortscience.org/paper?bibtexKey=journals/corr/1904.06387&a=muntermulehitch)) so that preferences (rankings) over demonstrations are generated automatically (back to the common IL/IRL setting where we only have access to a set of unlabeled demonstrations). Also derives some theoretical requirements and guarantees for better-than-demonstrator performance. ## Motivations * Preferences over demonstrations may be difficult to obtain in practice. * There is no theoretical understanding of the requirements that lead to outperforming demonstrator. ## Contributions * Theoretical results (with linear reward function) on when better-than-demonstrator performance is possible: 1- the demonstrator must be suboptimal (room for improvement, obviously), 2- the learned reward must be close enough to the reward that the demonstrator is suboptimally optimizing for (be able to accurately capture the intent of the demonstrator), 3- the learned policy (optimal wrt the learned reward) must be close enough to the optimal policy (wrt to the ground truth reward). Obviously if we have 2- and a good enough RL algorithm we should have 3-, so it might be interesting to see if one can derive a requirement from only 1- and 2- (and possibly a good enough RL algo). * Theoretical results (with linear reward function) showing that pairwise preferences over demonstrations reduce the error and ambiguity of the reward learning. They show that without rankings two policies might have equal performance under a learned reward (that makes expert's demonstrations optimal) but very different performance under the true reward (that makes the expert optimal everywhere). Indeed, the expert's demonstration may reveal very little information about the reward of (suboptimal or not) unseen regions which may hurt very much the generalizations (even with RL as it would try to generalize to new states under a totally wrong reward). They also show that pairwise preferences over trajectories effectively give half-space constraints on the feasible reward function domain and thus may decrease exponentially the reward function ambiguity. * Propose a practical way to generate as many ranked demos as desired. ## Additional Assumption Very mild, assumes that a Behavioral Cloning (BC) policy trained on the provided demonstrations is better than a uniform random policy. ## Disturbance-based Reward Extrapolation (D-REX) ![](https://i.imgur.com/9g6tOrF.png) ![](https://i.imgur.com/zSRlDcr.png) They also show that the more noise added to the BC policy the lower the performance of the generated trajs. ## Results Pretty much like T-REX. |
[link]
## General Framework Only access to a finite set of **ranked demonstrations**. The demonstrations only contains **observations** and **do not need to be optimal** but must be (approximately) ranked from worst to best. The **reward learning part is off-line** but not the policy learning part (requires interactions with the environment). In a nutshell: learns a reward models that looks at observations. The reward model is trained to predict if a demonstration's ranking is greater than another one's. Then, once the reward model is learned, one simply uses RL to learn a policy. This latter outperform the demonstrations' performance. ## Motivations Current IRL methods cannot outperform the demonstrations because they seek a reward function that makes the demonstrator optimal and thus do not infer the underlying intentions of the demonstrator that may have been poorly executed. In practice, high quality demonstrations may be difficult to provide and it is often easier to provide demonstrations with a ranking of their relative performance (desirableness). ## Trajectory-ranked Reward EXtrapolation (T-REX) ![](https://i.imgur.com/cuL8ZFJ.png =400x) Uses ranked demonstrations to extrapolate a user's underlying intent beyond the best demonstrations by learning a reward that assigns greater return to higher-ranked trajectories. While standard IRL seeks a reward that **justifies** the demonstrations, T-REX tries learns a reward that **explains** the ranking over demonstrations. ![](https://i.imgur.com/4IQ13TC.png =500x) Having rankings over demonstrations may remove the reward ambiguity problem (always 0 reward cannot explain the ranking) as well as provide some "data-augmentation" since from a few ranked demonstrations you can define many pair-wise comparisons. Additionally, suboptimal demonstrations may provide more diverse data by exploring a larger area of the state space (but may miss the parts relevant to solving the task...) ## Tricks and Tips Authors used ground truth reward to rank trajectories, but they also show that approximate ranking does not hurt the performance much. To avoid overfitting they used an ensemble of 5 Neural Networks to predict the reward. For episodic tasks, they compare subtrajectories that correspond to similar timestep (better trajectory is a bit later in the episode than the one it is compared against so that reward increases as the episode progresses). At RL training time, the learned reward goes through a sigmoid to avoid large changes in the reward scale across time-steps. ## Results ![](https://i.imgur.com/7ysYZKd.png) ![](https://i.imgur.com/CHO9aVT.png) ![](https://i.imgur.com/OzVD9sf.png =600x) Results are quite positive and performance can be good even when the learned reward is not really correlated with the ground truth (cf. HalfCheetah). They also show that T-REX is robust to different ranking-noises: random-swapping of pair-wise ranking, ranking by humans that only have access to a description of the task and not the ground truth reward. **They also automatically rank the demonstrations using the number of learning steps of a learning expert: therefore T-REX could be used as an intrinsic reward alongside the ground-truth to accelerate training.** ![](https://i.imgur.com/IfOeLY6.png =500x) **Limitations** They do not show than T-REX can match an optimal expert, maybe ranking demonstrations hurt when all the demos are close to optimality? |
[link]
## General Framework Really **similar to DAgger** (see [summary](https://www.shortscience.org/paper?bibtexKey=journals/corr/1011.0686&a=muntermulehitch)) but considers **cost-sensitive classification** ("some mistakes are worst than others": you should be more careful in imitating that particular action of the expert if failing in doing so incurs a large cost-to-go). By doing so they improve from DAgger's bound of $\epsilon_{class}uT$ where $u$ is the difference in cost-to-go (between the expert and one error followed by expert policy) to $\epsilon_{class}T$ where $\epsilon_{class}$ is the error due to the lack of expressiveness of the policy class. In brief, by accounting for the effect of a mistake on the cost-to-go they remove the cost-to-go contribution to the bound (difference in the performance of the learned policy vs. expert policy) and thus have a tighter bound. In the paper they use the word "regret" for two distinct concepts which is confusing to me: one for the no-regret online learning meta-approach to IL (similar to DAgger) and another one because Aggrevate aims at minimizing the cost-to-go difference with the expert (cost-to-go difference: the sum of the cost I endured because I did not behave like the expert once = *regret*) compared to DAgger that aims at minimizing the error rate wrt. the expert. Additionally, the paper extends the view of Imitation learning as an online learning procedure to Reinforcement Learning. ## Assumptions **Interactive**: you can re-query the expert and thus reach $\epsilon T$ bounds instead of $\epsilon T^2$ like with non-interactive methods (Behavioral Cloning) due to compounding error. Additionally, one also needs a **reward/cost** that **cannot** be defined relative to the expert (no 0-1 loss wrt expert for ex.) since the cost-to-go is computed under the expert policy (would always yield 0 cost). ## Other methods **SEARN**: does also reason about **cost-to-go but under the current policy** instead of the expert's (even if you can use the expert's in practice and thus becomes really similar to Aggrevate). SEARN uses **stochastic policies** and can be seen as an Aggrevate variant where stochastic mixing is used to solve the online learning problem instead of **Regularized-Follow-The-Leader (R-FTL)**. ## Aggrevate - IL with cost-to-go ![](https://i.imgur.com/I1otJwV.png) Pretty much like DAgger but one has to use a no-regret online learning algo to do **cost-sensitive** instead of regular classification. In the paper, they use the R-FTL algorithm and train the policy on all previous iterations. Indeed, using R-FTL with strongly convex loss (like the squared error) with stable batch leaner (like stochastic gradient descent) ensures the no-regret property. In practice (to deal with infinite policy classes and knowing the cost of only a few actions per state) they reduce cost-sensitive classification to an **argmax regression problem** where they train a model to match the cost given state-action (and time if we want nonstationary policies) using the collected datapoints and the (strongly convex) squared error loss. Then, they argmin this model to know which action minimizes the cost-to-go (cost-sensitive classification). This is close to what we do for **Q-learning** (DQN or DDPG): fit a critic (Q-values) with the TD-error (instead of full rollouts cost-to-go of expert), argmax your critic to get your policy. Similarly to DQN, the way you explore the actions of which you compute the cost-to-go is important (in this paper they do uniform exploration). **Limitations** If the policy class is not expressive enough and cannot match the expert policy performance this algo may fail to learn a reasonable policy. Example: the task is to go for point A to point B, there exist a narrow shortcut and a safer but longer road. The expert can handle both roads so it prefers taking the shortcut. Even if the learned policy class can handle the safer road it will keep trying to use the narrow one and fail to reach the goal. This is because all the costs-to-go are computed under the expert's policy, thus ignoring the fact that they cannot be achieved by any policy of the learned policy class. ## RL via No-Regrety Policy Iteration -- NRPI ![](https://i.imgur.com/X4ckv1u.png) NRPI does not require an expert policy anymore but only a **state exploration distribution**. NRPI can also be preferred when no policy in the policy class can match the expert's since it allows for more exploration by considering the **cost-to-go of the current policy**. Here, the argmax regression equivalent problem is really similar to Q-learning (where we use sampled cost-to-go from rollouts instead of Bellman errors) but where **the cost-to-go** of the aggregate dataset corresponds to **outdated policies!** (in contrast, DQN's data is comprised of rewards instead of costs-to-go). Yet, since R-FTL is a no-regret online learning method, the learned policy performs well under all the costs-to-go of previous iterations and the policies as well as the costs-to-go converge. The performance of NRPI is strongly limited to the quality of the exploration distribution. Yet if the exploration distribution is optimal, then NRPI is also optimal (the bound $T\epsilon_{regret} \rightarrow 0$ with enough online iterations). This may be a promising method for not interactive, state-only IL (if you have access to a reward). ## General limitations Both methods are much less sample efficient than DAgger as they require costs-to-go: one full rollout for one data-point. ## Broad contribution Seeing iterative learning methods such as Q-learning in the light of online learning methods is insightful and yields better bounds and understanding of why some methods might work. It presents a good tool to analyze the dynamics that interleaves learning and execution (optimizing and collecting data) for the purpose of generalization. For example, the bound for NRPI can seem quite counter-intuitive to someone familiar with on-policy/off-policy distinction, indeed NRPI optimizes a policy wrt to **costs-to-go of other policies**, yet R-FTL tells us that it converges towards what we want. Additionally, it may give a practical advantage for stability as the policy is optimized with larger batches and thus as to be good across many states and many cost-to-go formulations. |
[link]
## General Framework *wording: car = the autonomous car, driver = the other car it is interacting with* Builds a model of an **autonomous car's influence over the behavior of an interacting driver** (human or simulated) that the autonomous car can leverage to plan more efficiently. The driver is modeled by the policy that maximizes his defined objective. In brief, a **linear reward function is learned off-line with IRL on human demonstrations** and the modeled policy takes the actions that maximize the driver's return under this reward function (computed with **Model Predictive Control (MPC)**). They show that a car planning while using this driver learns ways to **influence the driver either towards specified behaviors** or in ways that **achieve higher payoff**. These results also hold when interacting with human **drivers which are loosely approximated by this driver model**. I believe the **key to this success lies in a learned reward that generalizes well** (linear function with few, clever hand-designed features) to new states as well as the **use of MPC**: by **focusing on modeling goals it captures a reasonable driver's policy** (since the learned reward is accurate) and does not have to deal with concurrent learning dynamics. On the other hand, IL would try to match behavior instead of goals and might not generalize as well to new (unseen) situations. Additionally, MPC enables to coordinate the car-driver interactions over **extended time horizons** (through planning). **This shows that leveraging an influence model is promising for communication emergence.** *Parallel with: Promoting influence like Jaques et al. is more effective than hoping agents figure it out by themselves like MADDPG* ## Motivations Previous methods use simplistic influence models ("will keep constant velocity"), yet the car behavior influences the driver whether the car is aware of it or not. This simplistic model only leads to "defensive behaviors" where the car avoids disturbing the driver and therefore does not interact with it and yields suboptimal strategies. Additionally, a simplistic interaction model seems to lead to exclusively to functional actions instead of communicative ones. ## Assumptions * **Sequential decision making**: the car acts first which forces a driver response which makes the world transition to a new state (could be alleviated by considering influence over time steps: car's action a time t influences driver's action at time t+1) * Approximates the **human as an optimal planner** (but with limited recursive induction) without uncertainty (deterministic model) etc. * The car uses **1st-order recursive induction** ("my action influences the driver but the driver believes that its action won't influence me (that my action is constant no matter what it does)"): *I see this as assuming "defensive" driver and "aggressive" car.* * **Fully observable** environment (not sure how difficult it would be to wave this) * Model Predictive Control (MPC) with L-BFGS requires access to the **differentiable transition function**. ## Method Model the interaction as a dynamical system where the car's action influences both the next state and the driver's action. Uses Theano to backprop through the transition function, and implicit function theorem to derive a symbolic expression of the gradient. ## Results * Car leverages driver's model to **influence** the simulated drivers (car has **perfect model**) in **specified ways** (modifying the car's reward function). * Car leverages driver's model to **influence** the simulated drivers (car has **perfect model**) in order to be **more efficient wrt its own objective**. * Car leverages driver's model to **influence** the human drivers (car has **imperfect model**) in **specified ways**. *colors: autonomous car is yellow, driver's car is red* ![](https://i.imgur.com/RK1Gx2P.png) ![](https://i.imgur.com/F77hSfp.png) ![](https://i.imgur.com/3elOG9O.png) ![](https://i.imgur.com/yeSsjiP.png) |
[link]
## General Framework The imitation learning problem is here cast into a classification problem: label the state with the corresponding expert action. With this, you can see structured prediction (predict next label knowing your previous prediction) as a degenerated IL problem. They make the **reduction assumption** that you can make the probability of mistake $\epsilon$ as small as desired on the **training distribution** (expert or mixture). They also assume that the difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by expert policy is small (which does not hold if one mistake makes you for fall from a cliff for example!). ## Problem definition There are three problems that are highlighted in this paper **1-interaction between policy and distribution**, **2-violation of the i.i.d assumption**, and **3-distributional shift and resulting compounding error of BC**. The paper implies that **1** implies **2** that implies **3** but they eventually only address **3** with the algorithm they propose (DAgger). 1. **interaction between policy and distribution**: the fact that the learned policy is present in the expectation (through the state-action distribution) as well as in the loss makes the **objective non-convex** and this even for convex loss functions. **Yet this also implies non i.i.d. samples!** Indeed, even if one could directly sample from the state-action distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update). 2. **violation of the i.i.d. assumption**: So we just discussed that 1. implies non-iid samples across updates (batches) yet there is another form of dependency (inside a batch this time) as in practice we collect samples using rollouts and in a trajectory s' depends on (s,a) so we have a dependency because we collected the **samples sequentially**. 3. **distribution-shift and compounding error of Behavior cloning (supervised learning)**: During training, you assume iid data and you do not model that the next state you will have to classify (act upon) is influenced by your previous action. But at execution samples are not iid: if you do a mistake you reach a state that you may have never seen (distribution shift) and you cannot recover(compounding error). If the classifier's probability of making a mistake is $\epsilon$ and the episode length is $T$, then BC can make as many as $\epsilon T^2$ mistakes (CF Levine's class example with the funambulist and the worst-case scenario: as soon as you make a mistake, you cannot recover and you make mistakes until the end of the episode). **This is what this paper (and the papers it compares to) addresses, they propose a worst-case scenario linear in both** $\epsilon$ and $T$. **Additionnal assumptions compared to BC:**During training you are allowed to query the expert as much as you want. ## Other methods * **Forward Training**: trains a **non-stationary policy** (one policy for each time step). Each policy is trained to match the expert on the distribution of states encountered at that time-step when doing rollouts with the previously learned policies (there is no distribution shift since each policy is trained on the actual distribution of states it will see). **Problem**: you have to train $T$ different policies, you cannot stop until you trained all the policies. * **Stochastic Mixing Iterative Learning (SMILe)**: iteratively trains a **stochastic mixture of policies**. Each new policy is trained to match the expert on the state distribution of the current mixture, therefore iteratively accounting for the distribution drift, eventually, the distribution should converge. This yields near-linear regret on $T$ and $\epsilon$. **Problem**: At execution time, we sample a policy from the mixture, which can perform poorly on the given state, especially if it corresponded to a different distribution (early iteration for example). ## DAgger -- Dataset Aggregation **Algo** ![](https://i.imgur.com/FpQAyI2.png =400x) $\beta_i$ is related to the proportion of state distribution that we want to collect similar to the expert distribution. This may help in the early iterations, where the policy is bad and may spend a lot of time collecting datapoints in states that are irrelevant. In practice using $\beta_i=I(i=0)$ works well. Intuitively, by training on generated transitions, DAgger reduces the distribution shift. **Link to no Regret Online Learning** DAgger can be seen as a *Follow-the-Leader* (FTL) algorithm where at iteration $n$ we pick the best policy in hindsight, i.e. with respect to all the transitions seen so far (in previous iterations). The use of $\beta_i$ is an "extra trick" that doesn't exist as is in FTL algos. *Regret*: the difference between what I did and the best thing that I could have done. Each iteration of DAgger can be seen as a step of Online Learning (treat mini-batches of trajectories under a single policy as a single online-learning example) where the online algo used is FTL but could actually be any no-regret algo. Indeed, by using the no-regret algo view, the authors show that using no-regret algos (regret decays with $1/N$) as a meta-algorithm for IL (just like FTL for DAgger) yield that, for N big enough (in the order of T), we have a worst-case performance linear in $\epsilon$, $u$ and $T$. Where it is **assumed** that we can reach an error probability of $\epsilon$ on any training distribution and a difference in the cost-to-go (-Q-value) of $u$. In other words, if we can guarantee a classification error rate of $\epsilon$ **on the training distribution** and that the expert is able to recover from a mistake ($u$ is bounded) then the use of a **no-regret online learning algorithm** will yield a policy that will make at most (proportional to) $\epsilon T u$ errors **on its own state distribution (test distribution)**. **Requirements** * An error rate of $\epsilon$ on the training distribution * A strongly convex loss function between expert and learner labels (0-1 loss is not but binary cross-entropy that is usually used for classification is) so that FTL is indeed a no-regret algo. If the loss in not strongly convex, FTL has no guarantees and we must use another no-regret method * The difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by the expert policy must be small (expert must be able to recover: this does not account for falling from a cliff for example!). **Results** ![](https://i.imgur.com/DZTK2nu.png =500x) ![](https://i.imgur.com/rbtyUbY.png =500x) ![](https://i.imgur.com/ZViGENA.png =500x) ![](https://i.imgur.com/fExmxeb.png =500x) ![](https://i.imgur.com/kv1X01w.png =500x) |