Compositional Obverter Communication Learning From Raw Visual Input
Edward Choi
and
Angeliki Lazaridou
and
Nando de Freitas
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.AI, cs.CL, cs.LG, cs.NE
First published: 2018/04/06 (6 years ago) Abstract: One of the distinguishing aspects of human language is its compositionality,
which allows us to describe complex environments with limited vocabulary.
Previously, it has been shown that neural network agents can learn to
communicate in a highly structured, possibly compositional language based on
disentangled input (e.g. hand- engineered features). Humans, however, do not
learn to communicate based on well-summarized features. In this work, we train
neural agents to simultaneously develop visual perception from raw image
pixels, and learn to communicate with a sequence of discrete symbols. The
agents play an image description game where the image contains factors such as
colors and shapes. We train the agents using the obverter technique where an
agent introspects to generate messages that maximize its own understanding.
Through qualitative analysis, visualization and a zero-shot test, we show that
the agents can develop, out of raw image pixels, a language with compositional
properties, given a proper pressure from the environment.
This paper proposes a new training method for multi-agent communication settings. They show the following referential game: A speaker sees an image of a 3d rendered object and describes it to a listener. The listener sees a different image and must decide if it is the same object as described by the speaker (has the same color and shape). The game can only be completed successfully if a communication protocol emerges that can express the color and shape the speaker sees.
The main contribution of the paper is the training algorithm. The speaker enumerates the message that would maximise its own understanding of the message given the image it sees (in a greedy way, symbol by symbol). The listener, given the image and the message, predicts a binary output and is trained using maximum likelihood given the correct answer. Only the listener is updating its parameters - therefore the speaker and listener change roles every number of rounds.
They show that a compositional communication protocol has emerged and evaluate it using zero-shot tests.
[Implemenation of this paper in pytorch](https://github.com/benbogin/obverter)