The paper "Bandits with unobs. confounders: a causal approach" addresses the problem of bandit learning. It is assumed that in the observational setting, the player's decision is influenced by some unobserved context. If we randomize the player's decision, however, this intention is lost. The key idea is now that, using the available data from both scenarios, one can infer whether one should overrule the player's intention. Ultimately, this leads to the following strategy: observe the player's intention and then decide whether he should act accordingly or pull the other arm.
The author showed that the current MAB algorithms actually attempt to maximize rewards according to the experimental distribution, which is not optimal in the confounding case, and proposed to make use of the effect of the treatment on the treated (ETT), i.e., by comparing the average payouts obtained by players for going in favor of or against their intuition. To me, the paper is interesting because it addresses the confounding issue in MAB and proposed a way to estimate some properties of the confounder (related to the casino's payout strategy in the given example) based on ETT.
At first glance, one might think that the blinking light on the slot machines (B) and the drunkenness of the patron (D) could be either modified or observed in lines 153-159, where we read about a hypothetical attempt to optimize reward using traditional Thompson sampling. If those factors were observable or subject to intervention -- and I'd think they would be, in reality -- then it would be straightforward to do better than the 30% reward rate that's given. The paper eventually makes it clear that both of these variables are unobserved and unalterable. It would help if this were explicit early in the example, or if the cover story were modified to make this aspect more intuitive.