Summaries from European Conference on Computer Vision on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Group Normalization
Yuxin Wu and Kaiming He
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV, cs.LG
more

[link] Summary by David Stutz 5 years ago

Wu and He propose group normalization as alternative to batch normalization. Instead of computing the statistics used for normalization based on the current mini-batch, group normalization computes these statistics per instance but in groups of channels (for convolutional layers). Specifically, given activations $x_i$ with $i = (i_N, i_C, i_H, i_W)$ indexing along batch size, channels, height and width, batch normalization computes

$\mu_i = \frac{1}{|S|}\sum_{k \in S} x_k$ and $\sigma_i = \sqrt{\frac{1}{|S|} \sum_{k \in S} (x_k - \mu_i)^2 + \epsilon}$

with the set $S$ holds all indices for a specific channel (i.e. across samples, height and width). For group normalization, in contrast, $S$ holds all indices of the current instance and group of channels. Meaning the statistics are computed across height, width and the current group of channels. Here, all channels can be divided into groups arbitrarily. In the paper, on ImageNet, groups of $32$ channels are used. Then, Figure 1 shows that for a batch size of 32, group normalization performs en-par with batch normalization – although the validation error is slightly larger. This is attributed to the stochastic element of batch normalization that leads to regularization. Figure 2 additionally shows the influence of the batch size of batch normalization and group normalization.

https://i.imgur.com/lwP5ycw.jpg
Figure 1: Training and validation error for different normalization schemes on ImageNet.

https://i.imgur.com/0c3CnEX.jpg
Figure 2: Validation error for different batch sizes.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

doi.org
sci-hub
scholar.google.com

MultiPoseNet: Fast Multi-Person Pose Estimation Using Pose Residual Network
Muhammed Kocabas and Salih Karagoz and Emre Akbas
Computer Vision – ECCV 2018 - 2018 via Local CrossRef
Keywords:

[link] Summary by Oleksandr Bailo 5 years ago

The method is a multi-task learning model performing person detection, keypoint detection, person segmentation, and pose estimation. It is a bottom-up approach as it first localizes identity-free semantics and then group them into instances.
https://i.imgur.com/kRs9687.png

Model structure:
 - **Backbone**. A feature extractor is presented by ResNet-(50 or 101) with one [Feature Pyramid Network](https://arxiv.org/pdf/1612.03144.pdf) (FPN) for keypoint branch and one for person detection branch. FPN enhances extracted features through multi-level representation.
 -  **Keypoint detection** detects keypoints as well as produces a pixel-level segmentation mask.
https://i.imgur.com/XFAi3ga.png
FPN features $K_i$ are processed with multiple $3\times3$ convolutions followed by concatenation and final $1\times1$  convolution to obtain predictions for each keypoint, as well as segmentation mask (see Figure for details). This results in #keypoints_in_dataset_per_person + 1 output layers. Additionally, intermediate supervision (i.e. loss) is applied at the FPN outputs. $L_2$ loss between predictions and Gaussian peaks at the keypoint locations is used. Similarly, $L_2$ loss is applied for segmentation predictions and corresponding ground truth masks.
 - **Person detection** is essentially a [RetinaNet](https://arxiv.org/pdf/1708.02002.pdf), a one-stage object detector, modified to only handle *person* class.
 - **Pose estimation**. Given initial keypoint predictions, Pose Estimation Network (PRN) selects a single keypoint for each class. 
https://i.imgur.com/k8wNP5p.png
During inference, PRN takes cropped outputs from keypoint detection branch defined by the predicted bounding boxes from the person detection branch, resizes it to a fixed size, and forwards it through a multilayer perceptron with residual connection. During the training, the same process is performed, except the cropped keypoints come from the ground truth annotation defined by a labeled bounding box.

This model is not an end-to-end trainable model. While keypoint and person detection branches can, in theory, be trained simultaneously, PRN network requires separate training.

**Personal note**. Interestingly, PRN training with ground truth inputs (i.e. "perfect" inputs) only reaches 89.4 mAP validation score which is surprisingly quite far from the max possible score. This presumably means that even if preceding networks or branches perform god-like, the PRN might become a bottleneck in the performance. Therefore, more efforts should be directed to PRN itself. Moreover, modifying the network to support end-to-end training might help in boosting the performance.

Open-source implementations used to make sure the paper apprehension is correct: [link1](https://github.com/LiMeng95/MultiPoseNet.pytorch), [link2](https://github.com/IcewineChen/pytorch-MultiPoseNet).

doi.org
sci-hub
scholar.google.com

Videos as Space-Time Region Graphs
Xiaolong Wang and Abhinav Gupta
Computer Vision – ECCV 2018 - 2018 via Local CrossRef
Keywords:

[link] Summary by Oleksandr Bailo 5 years ago

This paper tackles the challenge of action recognition by representing a video as space-time graphs: **similarity graph** captures the relationship between correlated objects in the video while the **spatial-temporal graph** captures the interaction between objects.

The algorithm is composed of several modules:

https://i.imgur.com/DGacPVo.png

1. **Inflated 3D (I3D) network**. In essence, it is usual 2D CNN (e.g. ResNet-50) converted to 3D CNN by copying 2D weights along an additional dimension and subsequent renormalization. The network takes *batch x 3 x 32 x 224 x 224* tensor input and outputs *batch x 16 x 14 x 14*.

2. **Region Proposal Network (RPN)**. This is the same RPN used to predict initial bounding boxes in two-stage detectors like Faster R-CNN. Specifically, it predicts a predefined number of bounding boxes on every other frame of the input (initially input is 32 frames, thus 16 frames are used) to match the temporal dimension of I3D network's output. Then, I3D network output features and projected on them bounding boxes are passed to ROIAlign to obtain temporal features for each object proposal. Fortunately, PyTorch comes with a [pretrained Faster R-CNN on MSCOCO](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) which can be easily cut to have only RPN functionality.

3. **Similarity Graph**. This graph represents a feature similarity between different objects in a video. Having features $x_i$ extracted by RPN+ROIAlign for every bounding box predictions in a video, the similarity between any pair of objects is computed as $F(x_i, x_j) = (wx_i)^T * (w'x_j)$, where $w$ and $w'$ are learnable transformation weights. Softmax normalization is performed on each edge on the graph connected to a current node $i$. Graph convolutional network is represented as several graph convolutional layers with ReLU activation in between. Graph construction and convolutions can be conveniently implemented using [PyTorch Geometric](https://github.com/rusty1s/pytorch_geometric).

4. **Spatial-Temporal Graph**. This graph captures a spatial and temporal relationship between objects in neighboring frames. To construct a graph $G_{i,j}^{front}$, we need to iterate through every bounding box in frame $t$ and compute Intersection over Union (IoU) with every object in frame $t+1$. The IoU value serves as the weight of the edge connecting nodes (ROI aligned features from RPN) $i$ and $j$. The edge values are normalized so that the sum of edge values connected to proposal $i$ will be 1. In a similar manner, the backward graph $G_{i,j}^{back}$ is defined by analyzing frames $t$ and $t-1$.

5. **Classification Head**. The classification head takes two inputs. One is coming from average pooled features from I3D model resulting in *1 x 512* tensor. The other one is from pooled sum of features (i.e. *1 x 512* tensor) from the graph convolutional networks defined above. Both inputs are concatenated and fed to Fully-Connected (FC) layer to perform final multi-label (or multi-class) classification.

**Dataset**. The authors have tested the proposed algorithm on [Something-Something](https://20bn.com/datasets/something-something) and [Charades](https://allenai.org/plato/charades/) datasets. For the first dataset, a softmax loss function is used, while the second one utilizes binary sigmoid loss to handle a multi-label property. The input data is sampled at 6fps, covering about 5 seconds of a video input.

**My take**. I think this paper is a great engineering effort. While the paper is easy to understand at the high-level, implementing it is much harder partially due to unclear/misleading writing/description. I have challenged myself with [reproducing this paper](https://github.com/BAILOOL/Videos-as-Space-Time-Region-Graphs). It is work in progress, so be careful not to damage your PC and eyes :-)

arxiv.org
arxiv-vanity.com
scholar.google.com

Women also Snowboard: Overcoming Bias in Captioning Models
Kaylee Burns and Lisa Anne Hendricks and Kate Saenko and Trevor Darrell and Anna Rohrbach
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Abir Das 7 years ago

Concern about the issue of fairness (or the lack of it) in machine learning models is gaining widespread visibility among general public, the governments as well as the researchers. This is especially alarming as AI enabled systems are becoming more and more pervasive in our society as decisions are being taken by AI agents in healthcare to autonomous driving to criminal justice and so on. Bias in any dataset is, in some way or other, a reflection of the general attitude of humankind towards different activities which are typified by certain gender, race or ethnicity. As these datasets are the sources of knowledge  for these AI models (especially the multimodal end-to-end models which depend only on the human annotated training datasets for literally everything), their decision making ability also gets shadowed by the bias in the dataset. This paper makes an important observation about the image captioning models that these models not only explore the bias in the dataset but tend to exaggerate them during inference. This is definitely a shortcoming of the current supervised models which are marked by their over-reliance on image context. The related works section of the paper (Section 2 first part: “Unwanted Dataset Bias”) gives an extensive review of the types of bias in the dataset and of the few recent works trying to address them. Gender bias (Presence of woman in kitchen makes most of us to guess a woman in a kitchen scene in case the person is not clearly apprehensible in the scene or a male is supposed to snowboard more often than a woman) and reporting biases (over reporting less common co-occurrences, such as “male nurse” or “green banana”) are two of the many present in machine learning datasets.

The paper addresses the problem of fair caption generation that would not presume a specific gender without appropriate evidence for that gender. This is done by introducing an ‘Equalizer Model’. This includes two complementary losses in addition to the normal cross entropy loss for the image captioning systems. The Appearance Confusion Loss (ACL) encourages the model to generate gender neutral words (for example ‘person’) when an image does not contain enough evidence of gender. During training, images of persons are masked out and the loss term encourages the gender words (“man” and “woman”) to have equal probability i.e., the model is encouraged to get confused when it should get confused instead of hallucinating from the context. The loss expression is pretty much intuitive (eqn (2) and (3)). However, it is not a good idea to make a model confused only. Thus the other loss (the Confident Loss (Conf)) is introduced. This loss encourages the model to predict gender words and predict them correctly when there is enough evidence of gender in the image. The loss function (eqns. (4) and (5)) has an intelligent use of the quotient between predicted probabilities of male and female gender words. If I have to give a single take away line from the paper then it will be the following which summarizes the working principle behind the two losses very succinctly.
> > “These complementary losses allow the Equalizer model to encourage models to be cautious in the absence of gender information and discriminative in its presence.”

The experiments are also well thought out. For experimentations, 3 different versions of the MSCOCO dataset is created - MSCOCO-Bias, MSCOCO-Confident and MSCOCO-Balanced. The bias in the gender gradually decreases in these 3 datasets. Three different metrics are also used to evaluate the model - Error rate (fraction of man/woman misclassifications), gender ratio (how close the gender ratio in the predicted captions of the test set is to the ground truth gender ratio), right for right reasons (whether the visual evidence used by the model for the prediction of the gender words coincide with the person images). There are a few baseline models and ablation studies. The baselines considered a naive image captioning model (‘Show and Tell’ approach), an approach where images corresponding to less common gender are sampled more while training and another baseline where the gender words are given higher weights in the cross-entropy loss. The ablation models considered the two losses (ACL and Conf) separately. For all the datasets, the proposed equalizer model consistently performed well according to all the 3 metrics. The experiments also show that, as the evaluation datasets become more and more balanced (i.e., the gender distribution departs more and more from the biased gender distribution in the training dataset), the performance of all the models falls away. However, the proposed model performs the best with the least inconsistency of performance among the the datasets. The qualitative examples with grad-cam and sliding window saliency maps for the gender words are also a positive point of the paper.

Things I would have liked the paper to contain:
* There are a few confusions in the expression of the conf loss in eqn. (4). Specifically, I am not sure what is the difference between $w_t$ and $\tilde{w}_t$. It seems the first one is the ground truth word and the later is the predicted word. It would have been good to have a clarification.

Overall, the paper is very new in defining the problem and in solving it. The solution strategy is very intuitive and easy to grasp. The paper is well written too. We can, sincerely, hope that this type of works addressing problems at the intersection of machine learning and societal issues would come more frequently and the discussed paper is a very significant first step towards it.