Reinforcement and Imitation Learning via Interactive No-Regret Learning on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Reinforcement and Imitation Learning via Interactive No-Regret Learning
Stephane Ross and J. Andrew Bagnell
arXiv e-Print archive - 2014 via Local arXiv
Keywords: cs.LG, stat.ML
more

Summaries/Notes 1

[link] Summary by Paul Barde 5 years ago

## General Framework
Really **similar to DAgger** (see [summary](https://www.shortscience.org/paper?bibtexKey=journals/corr/1011.0686&a=muntermulehitch)) but considers **cost-sensitive classification** ("some mistakes are worst than others": you should be more careful in imitating that particular action of the expert if failing in doing so incurs a large cost-to-go). By doing so they improve from DAgger's bound of $\epsilon_{class}uT$ where $u$ is the difference in cost-to-go (between the expert and one error followed by expert policy) to $\epsilon_{class}T$ where $\epsilon_{class}$ is the error due to the lack of expressiveness of the policy class. In brief, by accounting for the effect of a mistake on the cost-to-go they remove the cost-to-go contribution to the bound (difference in the performance of the learned policy vs. expert policy) and thus have a tighter bound. In the paper they use the word "regret" for two distinct concepts which is confusing to me: one for the no-regret online learning meta-approach to IL (similar to DAgger) and another one because Aggrevate aims at minimizing the cost-to-go difference with the expert (cost-to-go difference: the sum of the cost I endured because I did not behave like the expert once = *regret*) compared to DAgger that aims at minimizing the error rate wrt. the expert.

Additionally, the paper extends the view of Imitation learning as an online learning procedure to Reinforcement Learning.

## Assumptions
**Interactive**: you can re-query the expert and thus reach $\epsilon T$ bounds instead of $\epsilon T^2$ like with non-interactive methods (Behavioral Cloning) due to compounding error.
Additionally, one also needs a **reward/cost** that **cannot** be defined relative to the expert (no 0-1 loss wrt expert for ex.) since the cost-to-go is computed under the expert policy (would always yield 0 cost).

## Other methods
**SEARN**: does also reason about **cost-to-go but under the current policy** instead of the expert's (even if you can use the expert's in practice and thus becomes really similar to Aggrevate). SEARN uses **stochastic policies** and can be seen as an Aggrevate variant where stochastic mixing is used to solve the online learning problem instead of **Regularized-Follow-The-Leader (R-FTL)**.

## Aggrevate - IL with cost-to-go
![](https://i.imgur.com/I1otJwV.png)

Pretty much like DAgger but one has to use a no-regret online learning algo to do **cost-sensitive** instead of regular classification. In the paper, they use the R-FTL algorithm and train the policy on all previous iterations. Indeed, using R-FTL with strongly convex loss (like the squared error) with stable batch leaner (like stochastic gradient descent) ensures the no-regret property.

In practice (to deal with infinite policy classes and knowing the cost of only a few actions per state) they reduce cost-sensitive classification to an **argmax regression problem** where they train a model to match the cost given state-action (and time if we want nonstationary policies) using the collected datapoints and the (strongly convex) squared error loss. Then, they argmin this model to know which action minimizes the cost-to-go (cost-sensitive classification). This is close to what we do for **Q-learning** (DQN or DDPG): fit a critic (Q-values) with the TD-error (instead of full rollouts cost-to-go of expert), argmax your critic to get your policy. Similarly to DQN, the way you explore the actions of which you compute the cost-to-go is important (in this paper they do uniform exploration).

**Limitations**

If the policy class is not expressive enough and cannot match the expert policy performance this algo may fail to learn a reasonable policy. Example: the task is to go for point A to point B, there exist a narrow shortcut and a safer but longer road. The expert can handle both roads so it prefers taking the shortcut. Even if the learned policy class can handle the safer road it will keep trying to use the narrow one and fail to reach the goal. This is because all the costs-to-go are computed under the expert's policy, thus ignoring the fact that they cannot be achieved by any policy of the learned policy class.

## RL via No-Regrety Policy Iteration -- NRPI
![](https://i.imgur.com/X4ckv1u.png)

NRPI does not require an expert policy anymore but only a **state exploration distribution**. NRPI can also be preferred when no policy in the policy class can match the expert's since it allows for more exploration by considering the **cost-to-go of the current policy**.

Here, the argmax regression equivalent problem is really similar to Q-learning (where we use sampled cost-to-go from rollouts instead of Bellman errors) but where **the cost-to-go** of the aggregate dataset corresponds to **outdated policies!** (in contrast, DQN's data is comprised of rewards instead of costs-to-go).
Yet, since R-FTL is a no-regret online learning method, the learned policy performs well under all the costs-to-go of previous iterations and the policies as well as the costs-to-go converge.

The performance of NRPI is strongly limited to the quality of the exploration distribution. Yet if the exploration distribution is optimal, then NRPI is also optimal (the bound $T\epsilon_{regret} \rightarrow 0$ with enough online iterations). This may be a promising method for not interactive, state-only IL (if you have access to a reward).

## General limitations
Both methods are much less sample efficient than DAgger as they require costs-to-go: one full rollout for one data-point.

## Broad contribution
Seeing iterative learning methods such as Q-learning in the light of online learning methods is insightful and yields better bounds and understanding of why some methods might work. It presents a good tool to analyze the dynamics that interleaves learning and execution (optimizing and collecting data) for the purpose of generalization. For example, the bound for NRPI can seem quite counter-intuitive to someone familiar with on-policy/off-policy distinction, indeed NRPI optimizes a policy wrt to **costs-to-go of other policies**, yet R-FTL tells us that it converges towards what we want. Additionally, it may give a practical advantage for stability as the policy is optimized with larger batches and thus as to be good across many states and many cost-to-go formulations.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private