[link]
### Main ideas **Key problem:** To infer our preferences, even though our behavior may systematically diverge from them. Examples: a person who smokes event though they prefer not to (but are unable to quit) or somebody who would like to eat healthily, but regularly succumbs to temptation of donuts (which they consider unhealthy). **Proposed solution:** Modeling human biases directly into reasoning about given agent's behaviour. In the proposed solution [hyperbolic discounting](https://en.wikipedia.org/wiki/Hyperbolic_discounting) is used to account for our time inconsistency. ### Details Imagine a grid-world in which an agent moves around the grid to find a place to eat. https://i.imgur.com/dxL8fA1.png An agent is a tuple: $(p(s), U, Y, k, \alpha)$, where: * $s\in S$ is a state of the world, it is not described in detail in the paper, but, among others, it consists of things like: noodle place is open, vegetarian place is closed. * $p(s)$ is agent's belief about which state of the world it is in - it is modeled as a probability distribution over states. * $U$ is agent's (deterministic) utility function - this is the thing we would like to learn the most by observing agent's actions - $U: S \times A \rightarrow \mathbb{R}$, we assign utilities to actions, $a\in A$, in world states $s$. * Agent chooses actions stochastically where probability ($C(a;s)$) is proportional to their exponentiated expected utility: $C(a;s) \propto \exp^{\alpha EU_{s}[a]}$ or for discounting agents: $C(a;s) \propto \exp^{\alpha EU_{s,d}[a]}$, where $\alpha$ is noise parameter (the lower it is the more randomly agent behaves). Expected utility is described below. * $Y$ is a variable that denotes the kind of agent: * not discounting agent - as its name suggests it won't discount the utility of future actions regardless of the delay, so its expected utility is: $EU_s[a] = U(s,a) + \mathbb{E}_{s',a'}[EU_s'[a']]$, where $s'$ is a state in which agent ends after choosing action $a$ from state $s$ and $a'$ is action choosen in $s'$, * discounting naive agent - it discounts utility of future actions based on the delay $d$: $EU_{s,d}[a] =\frac{1}{1+kd} U(s,a) + \mathbb{E}_{s',a'}[EU_{s',d+1}[a']]$, where $k$ is a discount rate (part of agent's description (see tuple above)). Because of the discounting the utility of actions changes with time. It is possible then that an agent who decided to go to vegetarian cafe will change its decision once it is next to donut store. This is shown on the left in the image above. If the agent wanted to go to donut place it could've gone to the closer one. That is why it is called the naive agent - it doesn't take into account that the utility of its action changes and ends up doing things that it didn't plan. * discounting sophisticated agent - its expected utility is also discounted like in naive case, but it chooses future actions $a'$ as if the delay $d$ was $0$. In a sense, it knows that its future self will look at the immediate utility of actions rather then at the utilities they have now. Thanks to that it can for example choose a different path to vegetarian restaurant. It knows that it future self would end up in donut place if it went next to it. * $k$ is the discount rate (see dicounting naive agent description). * $\alpha$ is the noise parameter described above together with actions. Given a sequence of actions done by some agent, we want to infer its preferences. In the paper this is translated into: given a sequence of actions done by some agent update your probability distribution over agent tuples. We start with uniform distribution (zero knowledge) and do bayesian updates with consecutive actions. The model described above will be considered good if probability mass after the updates concentrates on the kinds of agents that a human would infer after seeing the same. ### Results According to experiments the model performs well, which means that it assigns high probabilities to the kind of agents that humans describe after seeing the same actions. For example, after seeing actions from the image above the model as well as human subjects would rate highly explanations of giving in to temptation (in case of naive planner) or avoiding temptation (in case of sophisticated one). The result holds for more complex scenarios: * inference with uncertainty - agent might have inaccurate beliefs, for example it might 'think' that the noodle place is open when in fact it's closed, * inference from multiple episodes - even though in two out of three cases agent chooses donuts both human subjects and the model assign high probability to the case where the vegetarian place is preferred (among others, they generally agree over variety of explanations). **Conclusion:** If we want to be able to delegate some of our decisions to ai systems then it is necessary that they are able to learn our preferences despite inconsistencies in our behaviour. The result presented in the paper shows that modeling our biases directly is a feasible direction of research. |