Embodied Question Answering
Abhishek Das
and
Samyak Datta
and
Georgia Gkioxari
and
Stefan Lee
and
Devi Parikh
and
Dhruv Batra
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.CV, cs.AI, cs.CL, cs.LG
First published: 2017/11/30 (6 years ago) Abstract: We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where
an agent is spawned at a random location in a 3D environment and asked a
question ("What color is the car?"). In order to answer, the agent must first
intelligently navigate to explore the environment, gather information through
first-person (egocentric) vision, and then answer the question ("orange").
This challenging task requires a range of AI skills -- active perception,
language understanding, goal-driven navigation, commonsense reasoning, and
grounding of language into actions. In this work, we develop the environments,
end-to-end-trained reinforcement learning agents, and evaluation protocols for
EmbodiedQA.
This paper introduces a new AI task - Embodied Question Answering. The goal of this task for an agent is to be able to answer the question by observing the environment through a single egocentric RGB camera while being able to navigate inside the environment. The agent has 4 natural modules:
https://i.imgur.com/6Mjidsk.png
1. **Vision**. 224x224 RGB images are processed by CNN to produce a fixed-size representation. This CNN is pretrained on pixel-to-pixel tasks such as RGB reconstruction, semantic segmentation, and depth estimation.
2. **Language**. Questions are encoded with 2-layer LSTMs with 128-d hidden states. Separate question encoders are used for the navigation and answering modules to capture important words for each module.
3. **Navigation** is composed of a planner (forward, left, right, and stop actions) and a controller that executes planner selected action for a variable number of times. The planner is LSTM taking hidden state, image representation, question, and previous action. Contrary, a controller is an MLP with 1 hidden layer which takes planner's hidden state, action from the planner, and image representation to execute an action or pass the lead back to the planner.
4. **Answering** module computes an image-question similarity of the last 5 frames via a dot product between image features (passed through an fc-layer to align with question features) and question encoding. This similarity is converted to attention weights via a softmax, and the attention-weighted image features are combined with the question features and passed through an answer classifier. Visually this process is shown in the figure below. https://i.imgur.com/LeZlSZx.png
[Successful results](https://www.youtube.com/watch?v=gVj-TeIJfrk) as well as [failure cases](https://www.youtube.com/watch?v=4zH8cz2VlEg) are provided.
Generally, this is very promising work which literally just scratches the surface of what is possible. There are several constraints which can be mitigated to push this field to more general outcomes. For example, use more general environments with more realistic graphics and broader set of questions and answers.