Summaries from arXiv e-Print archive on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
Stephane Ross and Geoffrey J. Gordon and J. Andrew Bagnell
arXiv e-Print archive - 2010 via Local arXiv
Keywords: cs.LG, cs.AI, stat.ML
more

[link] Summary by Paul Barde 4 years ago

## General Framework
The imitation learning problem is here cast into a classification problem: label the state with the corresponding expert action. With this, you can see structured prediction (predict next label knowing your previous prediction) as a degenerated IL problem. They make the **reduction assumption** that you can make the probability of mistake $\epsilon$ as small as desired on the **training distribution** (expert or mixture). They also assume that the difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by expert policy is small (which does not hold if one mistake makes you for fall from a cliff for example!).

## Problem definition
There are three problems that are highlighted in this paper **1-interaction between policy and distribution**, **2-violation of the i.i.d assumption**, and **3-distributional shift and resulting compounding error of BC**. The paper implies that **1** implies **2** that implies **3** but they eventually only address **3** with the algorithm they propose (DAgger).
1. **interaction between policy and distribution**: the fact that the learned policy is present in the expectation (through the state-action distribution) as well as in the loss makes the **objective non-convex** and this even for convex loss functions. **Yet this also implies non i.i.d. samples!** Indeed, even if one could directly sample from the state-action distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update).
2. **violation of the i.i.d. assumption**: So we just discussed that 1. implies non-iid samples across updates (batches) yet there is another form of dependency (inside a batch this time) as in practice we collect samples using rollouts and in a trajectory s' depends on (s,a) so we have a dependency because we collected the **samples sequentially**.
3. **distribution-shift and compounding error of Behavior cloning (supervised learning)**: During training, you assume iid data and you do not model that the next state you will have to classify (act upon) is influenced by your previous action. But at execution samples are not iid: if you do a mistake you reach a state that you may have never seen (distribution shift) and you cannot recover(compounding error). If the classifier's probability of making a mistake is $\epsilon$ and the episode length is $T$, then BC can make as many as $\epsilon T^2$ mistakes (CF Levine's class example with the funambulist and the worst-case scenario: as soon as you make a mistake, you cannot recover and you make mistakes until the end of the episode). **This is what this paper (and the papers it compares to) addresses, they propose a worst-case scenario linear in both** $\epsilon$ and $T$.

**Additionnal assumptions compared to BC:**During training you are allowed to query the expert as much as you want.

## Other methods
* **Forward Training**: trains a **non-stationary policy** (one policy for each time step). Each policy is trained to match the expert on the distribution of states encountered at that time-step when doing rollouts with the previously learned policies (there is no distribution shift since each policy is trained on the actual distribution of states it will see). **Problem**: you have to train $T$ different policies, you cannot stop until you trained all the policies.
* **Stochastic Mixing Iterative Learning (SMILe)**: iteratively trains a **stochastic mixture of policies**. Each new policy is trained to match the expert on the state distribution of the current mixture, therefore iteratively accounting for the distribution drift, eventually, the distribution should converge. This yields near-linear regret on $T$ and $\epsilon$. **Problem**: At execution time, we sample a policy from the mixture, which can perform poorly on the given state, especially if it corresponded to a different distribution (early iteration for example).

## DAgger -- Dataset Aggregation
**Algo**
![](https://i.imgur.com/FpQAyI2.png =400x)
$\beta_i$ is related to the proportion of state distribution that we want to collect similar to the expert distribution. This may help in the early iterations, where the policy is bad and may spend a lot of time collecting datapoints in states that are irrelevant. In practice using $\beta_i=I(i=0)$ works well.
Intuitively, by training on generated transitions, DAgger reduces the distribution shift.

**Link to no Regret Online Learning**
DAgger can be seen as a *Follow-the-Leader* (FTL) algorithm where at iteration $n$ we pick the best policy in hindsight, i.e. with respect to all the transitions seen so far (in previous iterations).
The use of $\beta_i$ is an "extra trick" that doesn't exist as is in FTL algos.

*Regret*: the difference between what I did and the best thing that I could have done.

Each iteration of DAgger can be seen as a step of Online Learning (treat mini-batches of trajectories under a single policy as a single online-learning example) where the online algo used is FTL but could actually be any no-regret algo. Indeed, by using the no-regret algo view, the authors show that using no-regret algos (regret decays with $1/N$) as a meta-algorithm for IL (just like FTL for DAgger) yield that, for N big enough (in the order of T), we have a worst-case performance linear in $\epsilon$, $u$ and $T$. Where it is **assumed** that we can reach an error probability of $\epsilon$ on any training distribution and a difference in the cost-to-go (-Q-value) of $u$. In other words, if we can guarantee a classification error rate of $\epsilon$ **on the training distribution** and that the expert is able to recover from a mistake ($u$ is bounded) then the use of a **no-regret online learning algorithm** will yield a policy that will make at most (proportional to) $\epsilon T u$ errors **on its own state distribution (test distribution)**.

**Requirements**
* An error rate of $\epsilon$ on the training distribution
* A strongly convex loss function between expert and learner labels (0-1 loss is not but binary cross-entropy that is usually used for classification is) so that FTL is indeed a no-regret algo. If the loss in not strongly convex, FTL has no guarantees and we must use another no-regret method
* The difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by the expert policy must be small (expert must be able to recover: this does not account for falling from a cliff for example!).

**Results**

![](https://i.imgur.com/DZTK2nu.png =500x)
![](https://i.imgur.com/rbtyUbY.png =500x)
![](https://i.imgur.com/ZViGENA.png =500x)
![](https://i.imgur.com/fExmxeb.png =500x)
![](https://i.imgur.com/kv1X01w.png =500x)