Towards Deep Symbolic Reinforcement Learning
Marta Garnelo
and
Kai Arulkumaran
and
Murray Shanahan
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.AI, cs.LG
First published: 2016/09/18 (8 years ago) Abstract: Deep reinforcement learning (DRL) brings the power of deep neural networks to
bear on the generic task of trial-and-error learning, and its effectiveness has
been convincingly demonstrated on tasks such as Atari video games and the game
of Go. However, contemporary DRL systems inherit a number of shortcomings from
the current generation of deep learning techniques. For example, they require
very large datasets to work effectively, entailing that they are slow to learn
even when such datasets are available. Moreover, they lack the ability to
reason on an abstract level, which makes it difficult to implement high-level
cognitive functions such as transfer learning, analogical reasoning, and
hypothesis-based reasoning. Finally, their operation is largely opaque to
humans, rendering them unsuitable for domains in which verifiability is
important. In this paper, we propose an end-to-end reinforcement learning
architecture comprising a neural back end and a symbolic front end with the
potential to overcome each of these shortcomings. As proof-of-concept, we
present a preliminary implementation of the architecture and apply it to
several variants of a simple video game. We show that the resulting system --
though just a prototype -- learns effectively, and, by acquiring a set of
symbolic rules that are easily comprehensible to humans, dramatically
outperforms a conventional, fully neural DRL system on a stochastic variant of
the game.
DRL has lot of disadvantages like large data requirement, slow learning, difficult interpretation, difficult transfer, no causality, analogical reasoning done at a statistical level not at a abstract level etc. This can be overcome by adding a symbolic front end on top of DL layer before feeding it to RL agent. Symbolic front end gives advantage of smaller state space generalization, flexible predicate length and easier combination of predicate expressions. DL avoids manual creation of features unlike symbolic reasoning. Hence DL along with symbolic reasoning might be the way to progress for AGI. State space reduction in symbolic reasoning is carried out by using object interactions(object positions and object types) for state representation. Although certain assumptions are made in the process such as objects of same type behave similarly etc, one can better understand causal relations in terms of actions, object interactions and reward by using symbolic reasoning.
Broadly, pipeline consists of (1)CNN layer - Raw pixels to representation (2)Salient pixel identification - Pixels that have activations in CNN above a certain threshold (3)Identify objects of similar kind by using activation spectra of salient pixels (4)Identify similar objects in consecutive time steps to track object motion using spatial closeness(as objects can move only by a small distance in consecutive frames) and similar neighbors(different type of objects can be placed close to each other and spatial closeness alone cannot identify similar objects) (4)Building symbolic interactions by using relative object positions for all pairs of objects located within a certain maximal distance. Relative object position is necessary to capture object dynamics. Maximal distance threshold is required to make the learning quicker eventhough it may reach a locally optimal policy (4)RL agent uses object interactions as states in Q-Learning update. Instead of using all object interactions in a frame as one state, number of states are further reduced by considering interactions between two types to be independent of other types and doing a Q-Learning update separately for each type pair. Intuitive explanation for doing so is to look at a frame as a set of independent object type interactions. Action choice at a state is then the one that maximizes sum of Q values across all type pairs.
Results claim that using DRL with symbolic reasoning, transfer in policies can be observed by first training on evenly spaced grid world and using it for randomly spaced grid world with a performance close to 70% contrary to DQN that achieves 50% even after training for 1000 epochs with epoch length of 100.