This paper addresses the problem of inverse reinforcement learning when the agent can change it's objective during the recording of trajectories. This results in a transition between several reward functions that explain only locally the trajectory of the observed agent. Transition probabilities between reward functions are unknown. The author propose a cascade of an EM and Viterbi algorithms to discover the reward functions and the segments on which they are valid.
Their algorithm consists in maximizing the log-likelihood of the expert's demonstrated trajectories depending on some parameters which are the original distributions of states and rewards, the local rewards and the transition function between rewards. To do so, they use the expectation-maximisation (EM) method. Then, via the Viterbi algorithm, they are able to partition the trajectories into segments with local consistent rewards.
Strengths of the paper:
1. The authors leverage existing and classical methods from the machine learning and optimization fields such as EM, Viterbi, Value iteration and gradient ascent in order to build their algorithm. This will allow the community to easily reproduce their results. 2. The experiments are conducted on synthetic and real-world data. They compare their method to MLIRL which does not use locally consistent rewards and which is the canonical choice to compare to as their algorithm is a generalization of MLIRL. The results presented show the superiority of their method over MLIRL. 3. The idea presented by the authors is original as far as I know.
Weaknesses of the paper:
1. The paper is very dense ( the figures are incorporated in the text) which makes the reading difficult.
2. The algorithm proposed needs the knowledge of the dynamics and the number of rewards. The authors, as future works, plan to extend their algorithm to unknown number of rewards, however they do not mention to get rid off the knowledge of the dynamics. Could the authors comment on that as some IRL algorithms do not need a perfect knowledge of the dynamics?
3. The method needs to solve iteratively MDPs when learning the reward functions. For each theta in the gradient ascent a MDP needs to be solved. Is this prohibitive for huge MDPs? Is there a way to avoid that step? The action-value function Q is defined via a softmax operator in order to have a derivable policy, does it allow to solve more efficiently the MDP?
4. The authors are using gradient ascent in the EM method, could they comment on the concavity of their criteria?
5. In the experiments (gridworlds), the number of features for the states is very small and thus it is understandable that a reward which is linear on the features will perform badly. Do the authors consider comparing their method to an IRL method where the number of features defining the states is greater? This is the main problem that I have with the experiments, the features used are not expressive enough to consider using a classical IRL method and this can explain why MLIRL performs badly and that its performance does not improve when the number of expert trajectories grows.
6. The performance is measured by the average log-likelihood of the expert's demonstrated trajectories which is the criterion maximized by the algorithm. I think that a more pertinent measure would be the value function of the policy produced by the optimization of the reward obtained by the algorithm. Could the authors comment on that and explain why their performance metric is more appropriate?