[link]
Summary by ngthanhtinqn 1 year ago
Visual Question Answering can not do the counting objects problem properly. So in this paper, they figured out the reason is due to the Soft Attention module, and they also proposed a module that can produce reliable counting from object proposals.
There are two challenges in VQA Counting tasks:
(1) There is no ground truth label for the objects to be counted.
(2) The additional module should not affect performance on non-counting problems.
Why Soft Attention is not good for the counting task:
One case to explain why Soft Attention limits counting ability:
Consider the task of counting cats for two images: an image of a cat and an image that contains two images side-by-side that are copies of the first image.
For image 1: after the normalization of the softmax function in the attention, the cat in this image will receive a normalized weight of 1.
For image 2: each cat receives a weight of 0.5.
Then, the attention module will do the weighted sum to produce an attention feature vector. Because the weighted sum process will average the two cats in the second image back to a single cat, so 2 attention feature vectors of the two images are the same. As a result, the information about possible counts is lost by using the attention map.
Counting Component:
This component will be in charge of counting objects for an image. This has two things to do:
1) A differentiable mechanism for counting from attention weights.
2) Handling overlapping object proposals to reduce object double-counting.
The Counting Component is as follows:
https://i.imgur.com/xVGcaov.png
Note that, intra-objects are objects that point to the same object and the same class, while inter-objects are objects that point to the different object and the same class.
They have three main components: (1) object proposals (4 vertices), the black ones are relevant objects while the white ones are irrelevant objects. Then (2) intra-object edges between duplicate proposals, and (3) blue edges mark the inter-object duplicate edges. Finally, there will be one edge and 2 vertices (2 relevant objects).
To illustrate the component in more detail, there are 4 main steps:
(1) Input: The component needs n attention weights $a = [a_{1}, a_{2},...,a_{n}]^{T}$ and their corresponding boxes $b = [b_{1}, ..., b_{n}]^{T}$
(2) Deduplication: The goal of this step is to make a graph $A=aa^{T}$ (attention matrix) where each vertex is a bounding box proposal if the $ith$ bounding box is a relevant box, then $a_{i} = 1$ otherwise, $a_{i} = 0$.
And the Counting Component will modify this graph to delete those edges until the graph becomes a fully directed graph with self-loops.
For example, [a1, a2, a3, a4, a5]=[1,0,1,0,1], the subgraph containing a1, a3, or a5 is a fully directed graph, as follows:
https://i.imgur.com/cCKIQ0K.png
The illustration for this graph is as follows:
https://i.imgur.com/x93gk8c.png
Then we will eliminate duplicate edges:
(1) intra-object edges and (2) inter-object edges.
1. Intra-object edges
First, we eliminate intra-object edges.
To achieve this, we need to calculate the distance matrix $D$ where $D_{ij} = 1- IoU(b_{i}, b_{j})$, if $D_{ij}=1$ which means two bounding boxes are quite overlapped, and then should be eliminated.
To remove them, multiply the attention matrix $A$, which is calculated before, with the matrix $D$, to remove the connection between duplicate proposals of a single object.
https://i.imgur.com/TQAvAnW.png
2. Inter-object edges
Second, we eliminate inter-object edges.
The main idea is to combine the proposals of the duplicate objects into 1.
To do this, scale down the weight of its associated edges (vertices connected to that vertex).
For example, if an object has two proposals, the edges involving those proposals should be scaled by 0.5. Essentially, this is averaging the proposal within each base object, since we only use the sum of edge weights to compute the final count.
https://i.imgur.com/4An0BAj.png
more
less