Learning by Asking Questions
Ishan Misra
and
Ross Girshick
and
Rob Fergus
and
Martial Hebert
and
Abhinav Gupta
and
Laurens van der Maaten
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.CV, cs.CL, cs.LG
First published: 2017/12/04 (6 years ago) Abstract: We introduce an interactive learning framework for the development and
testing of intelligent visual systems, called learning-by-asking (LBA). We
explore LBA in context of the Visual Question Answering (VQA) task. LBA differs
from standard VQA training in that most questions are not observed during
training time, and the learner must ask questions it wants answers to. Thus,
LBA more closely mimics natural learning and has the potential to be more
data-efficient than the traditional VQA setting. We present a model that
performs LBA on the CLEVR dataset, and show that it automatically discovers an
easy-to-hard curriculum when learning interactively from an oracle. Our LBA
generated data consistently matches or outperforms the CLEVR train data and is
more sample efficient. We also show that our model asks questions that
generalize to state-of-the-art VQA models and to novel test time distributions.
This paper is about interactive Visual Question Answering (VQA) setting in which agents must ask questions about images to learn. This closely mimics how people learn from each other using natural language and has a strong potential to learn much faster with fewer data. It is referred as learning by asking (LBA) through the paper. The approach is composed of three models:
http://imisra.github.io/projects/lba/approach_HQ.jpeg
1. **Question proposal module** is responsible for generating _important_ questions about the image. It is a combination of 2 models:
- **Question generator** model produces a question. It is LSTM that takes image features and question type (random choice from available options) as input and outputs a question.
- **Question relevance** model that selects questions relevant to the image. It is a stacked attention architecture network (shown below) that takes in generated question and image features and filters out irrelevant to the image questions. https://i.imgur.com/awPcvYz.png
2. **VQA module** learns to predict answer given the image features and question. It is implemented as stacked attention architecture shown above.
3. **Question selection module** selects the most informative question to ask. It takes current state of VQA module and its output to calculate expected accuracy improvement (details are in the paper) to measure how fast the VQA module has a potential to improve for each answer. The single question selection (i.e. best question for VQA to improve the fastest) strategy is based on epsilon-greedy policy.
This method (i.e. LBA) is shown to be about 50% more data efficient than naive VQA method. As an interesting future direction of this work, the authors propose to use real-world images and include a human in the training as an answer provider.