[link]
Summary by CodyWild 5 years ago
Language seems obviously useful to humans in coordinating on complicated tasks, and, the logic goes, you might expect that if you gave agents in a multi-agent RL system some amount of shared interest, and the capacity to communicate, that they would use that communication channel to coordinate actions. This is particularly true in cases where some part of the environment is only visible to one of the agents. A number of papers in the field have set up such scenarios, and argued that meaningful communication strategies developed, mostly in the form of one agent sending a message to signal its planned action to the other agent before both act. This paper tries to tease apart the various quantitative metrics used to evaluate whether informative message are being sent, and tries to explain why they can diverge from each other in unintuitive ways. The experiments in the paper are done in quite simple environments, where there are simple one-shot actions and a payoff matrix, as well as an ability for the agents to send messages before acting.
Some metrics identified by the paper are:
- Speaker Consistency: There’s high mutual information shown between the message a speaker sends, and what action it takes. Said another way, you could use a speaker’s message to predict their action at a rate higher than random, because it contains information about the action
- Heightened reward/task completion under communication: Fairly straightforward, this metric argues that informative communication happened when pairs of agents do better in the presence of communication channels than when they aren’t available
- Instantaneous coordination: Measures the mutual information between the message sent by agent A and the action of agent B, in a similar way to Speaker Consistency.
This work agrees that it’s important to measure the causal impact of messages on other-agent actions, but argues that instantaneous communication is flawed because the mutual information metric between messages and response actions doesn’t properly condition on the state of the game under whcih the message is being sent. Even if you successfully communicate your planned action to me, the action I actually take in response will be conditioned on my personal payoff matrix, and may average out to seeming unrelated or random if you take an expectation over every possible state the message could be recieved in. Instead, they suggest doing an explicit causal causal approach, where for each configuration of the game (different payoff matrix), they sample different messages, and calculate whether you see messages driving more consistent actions when you condition on other factors in the game.
An interesting finding of this paper is that, at least in these simple environments, you’re able to find cases where there is Speaker Consistency (SC; messages that contain information about the speaker’s next action), but no substantial Causal Influence of Communication (CIC). This may seem counterintuitive, since, why would you as an agent send a message containing information about your action, if not because you’re incentivized to communicate with the other agent? It seems like the answer is that it’s possible to have this kind of shared information *on accident,* as a result of the shared infrastructure between the action network and the messaging network. Because both use a shared set of early-layer representations, you end up having one contain information about the other as an incidental fact; if the networks are fully separated with no shared weights, the Speaker Consistency values drop.
An important caveat to make here is that this paper isn’t, or at least shouldn’t be, arguing that agents in multi-agent systems don’t actually learn communication. The environments used here are quite simple, and just might not plausibly be difficult enough to incentivize communication. However, it is a fair point that it’s valuable to be precise in what exactly we’re measuring, and test how that squares with what we actually care about in a system, to try to avoid cases like these where we may be liable to be led astray by our belief about how the system *should* be learning, rather than how it actually is
more
less