Automatically learn which Active Learning strategy to use.
They use the multi-armed bandit framework where each arm is an Active Learning strategy.
The core RL algorithm used is [EXP4.P](https://arxiv.org/abs/1002.4058) which is itself based on EXP4 (**Exp**onential weighting for **Exp**loration and **Exp**lotation with **Exp**erts). They make only slight adjustments to the reward function.
[![screen shot 2017-06-14 at 7 33 46 pm](https://user-images.githubusercontent.com/17261080/27146101-6d8392b4-5138-11e7-8e12-5617b258ddfa.png)](https://user-images.githubusercontent.com/17261080/27146101-6d8392b4-5138-11e7-8e12-5617b258ddfa.png)
Beats all other techniques most of the time and make sure that in the long run we use the best strategy.