[link]
This paper is about interactive Visual Question Answering (VQA) setting in which agents must ask questions about images to learn. This closely mimics how people learn from each other using natural language and has a strong potential to learn much faster with fewer data. It is referred as learning by asking (LBA) through the paper. The approach is composed of three models: http://imisra.github.io/projects/lba/approach_HQ.jpeg 1. **Question proposal module** is responsible for generating _important_ questions about the image. It is a combination of 2 models: - **Question generator** model produces a question. It is LSTM that takes image features and question type (random choice from available options) as input and outputs a question. - **Question relevance** model that selects questions relevant to the image. It is a stacked attention architecture network (shown below) that takes in generated question and image features and filters out irrelevant to the image questions. https://i.imgur.com/awPcvYz.png 2. **VQA module** learns to predict answer given the image features and question. It is implemented as stacked attention architecture shown above. 3. **Question selection module** selects the most informative question to ask. It takes current state of VQA module and its output to calculate expected accuracy improvement (details are in the paper) to measure how fast the VQA module has a potential to improve for each answer. The single question selection (i.e. best question for VQA to improve the fastest) strategy is based on epsilon-greedy policy. This method (i.e. LBA) is shown to be about 50% more data efficient than naive VQA method. As an interesting future direction of this work, the authors propose to use real-world images and include a human in the training as an answer provider.
Your comment:
|