[link]
## General Framework The imitation learning problem is here cast into a classification problem: label the state with the corresponding expert action. With this, you can see structured prediction (predict next label knowing your previous prediction) as a degenerated IL problem. They make the **reduction assumption** that you can make the probability of mistake $\epsilon$ as small as desired on the **training distribution** (expert or mixture). They also assume that the difference in the costtogo (Qvalue) between expertaction followed by expert policy and any action followed by expert policy is small (which does not hold if one mistake makes you for fall from a cliff for example!). ## Problem definition There are three problems that are highlighted in this paper **1interaction between policy and distribution**, **2violation of the i.i.d assumption**, and **3distributional shift and resulting compounding error of BC**. The paper implies that **1** implies **2** that implies **3** but they eventually only address **3** with the algorithm they propose (DAgger). 1. **interaction between policy and distribution**: the fact that the learned policy is present in the expectation (through the stateaction distribution) as well as in the loss makes the **objective nonconvex** and this even for convex loss functions. **Yet this also implies non i.i.d. samples!** Indeed, even if one could directly sample from the stateaction distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update). 2. **violation of the i.i.d. assumption**: So we just discussed that 1. implies noniid samples across updates (batches) yet there is another form of dependency (inside a batch this time) as in practice we collect samples using rollouts and in a trajectory s' depends on (s,a) so we have a dependency because we collected the **samples sequentially**. 3. **distributionshift and compounding error of Behavior cloning (supervised learning)**: During training, you assume iid data and you do not model that the next state you will have to classify (act upon) is influenced by your previous action. But at execution samples are not iid: if you do a mistake you reach a state that you may have never seen (distribution shift) and you cannot recover(compounding error). If the classifier's probability of making a mistake is $\epsilon$ and the episode length is $T$, then BC can make as many as $\epsilon T^2$ mistakes (CF Levine's class example with the funambulist and the worstcase scenario: as soon as you make a mistake, you cannot recover and you make mistakes until the end of the episode). **This is what this paper (and the papers it compares to) addresses, they propose a worstcase scenario linear in both** $\epsilon$ and $T$. **Additionnal assumptions compared to BC:**During training you are allowed to query the expert as much as you want. ## Other methods * **Forward Training**: trains a **nonstationary policy** (one policy for each time step). Each policy is trained to match the expert on the distribution of states encountered at that timestep when doing rollouts with the previously learned policies (there is no distribution shift since each policy is trained on the actual distribution of states it will see). **Problem**: you have to train $T$ different policies, you cannot stop until you trained all the policies. * **Stochastic Mixing Iterative Learning (SMILe)**: iteratively trains a **stochastic mixture of policies**. Each new policy is trained to match the expert on the state distribution of the current mixture, therefore iteratively accounting for the distribution drift, eventually, the distribution should converge. This yields nearlinear regret on $T$ and $\epsilon$. **Problem**: At execution time, we sample a policy from the mixture, which can perform poorly on the given state, especially if it corresponded to a different distribution (early iteration for example). ## DAgger  Dataset Aggregation **Algo** ![](https://i.imgur.com/FpQAyI2.png =400x) $\beta_i$ is related to the proportion of state distribution that we want to collect similar to the expert distribution. This may help in the early iterations, where the policy is bad and may spend a lot of time collecting datapoints in states that are irrelevant. In practice using $\beta_i=I(i=0)$ works well. Intuitively, by training on generated transitions, DAgger reduces the distribution shift. **Link to no Regret Online Learning** DAgger can be seen as a *FollowtheLeader* (FTL) algorithm where at iteration $n$ we pick the best policy in hindsight, i.e. with respect to all the transitions seen so far (in previous iterations). The use of $\beta_i$ is an "extra trick" that doesn't exist as is in FTL algos. *Regret*: the difference between what I did and the best thing that I could have done. Each iteration of DAgger can be seen as a step of Online Learning (treat minibatches of trajectories under a single policy as a single onlinelearning example) where the online algo used is FTL but could actually be any noregret algo. Indeed, by using the noregret algo view, the authors show that using noregret algos (regret decays with $1/N$) as a metaalgorithm for IL (just like FTL for DAgger) yield that, for N big enough (in the order of T), we have a worstcase performance linear in $\epsilon$, $u$ and $T$. Where it is **assumed** that we can reach an error probability of $\epsilon$ on any training distribution and a difference in the costtogo (Qvalue) of $u$. In other words, if we can guarantee a classification error rate of $\epsilon$ **on the training distribution** and that the expert is able to recover from a mistake ($u$ is bounded) then the use of a **noregret online learning algorithm** will yield a policy that will make at most (proportional to) $\epsilon T u$ errors **on its own state distribution (test distribution)**. **Requirements** * An error rate of $\epsilon$ on the training distribution * A strongly convex loss function between expert and learner labels (01 loss is not but binary crossentropy that is usually used for classification is) so that FTL is indeed a noregret algo. If the loss in not strongly convex, FTL has no guarantees and we must use another noregret method * The difference in the costtogo (Qvalue) between expertaction followed by expert policy and any action followed by the expert policy must be small (expert must be able to recover: this does not account for falling from a cliff for example!). **Results** ![](https://i.imgur.com/DZTK2nu.png =500x) ![](https://i.imgur.com/rbtyUbY.png =500x) ![](https://i.imgur.com/ZViGENA.png =500x) ![](https://i.imgur.com/fExmxeb.png =500x) ![](https://i.imgur.com/kv1X01w.png =500x)
Your comment:
