AI Safety Gridworlds
Jan Leike
and
Miljan Martic
and
Victoria Krakovna
and
Pedro A. Ortega
and
Tom Everitt
and
Andrew Lefrancq
and
Laurent Orseau
and
Shane Legg
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.LG, cs.AI
First published: 2017/11/27 (7 years ago) Abstract: We present a suite of reinforcement learning environments illustrating
various safety properties of intelligent agents. These problems include safe
interruptibility, avoiding side effects, absent supervisor, reward gaming, safe
exploration, as well as robustness to self-modification, distributional shift,
and adversaries. To measure compliance with the intended safe behavior, we
equip each environment with a performance function that is hidden from the
agent. This allows us to categorize AI safety problems into robustness and
specification problems, depending on whether the performance function
corresponds to the observed reward function. We evaluate A2C and Rainbow, two
recent deep reinforcement learning agents, on our environments and show that
they are not able to solve them satisfactorily.