ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

openaccess.thecvf.com
sci-hub
scholar.google.com

3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
Pavllo, Dario and Feichtenhofer, Christoph and Grangier, David and Auli, Michael
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - 2019 via Local Bibsonomy
Keywords: 3D, Human, estimation, pose

[link] Summary by Oleksandr Bailo 6 years ago

This paper proposes a 3D human pose estimation in video method based on the dilated temporal convolutions applied on 2D keypoints (input to the network). 2D keypoints can be obtained using any person keypoint detector, but Mask R-CNN with ResNet-101 backbone, pre-trained on COCO and fine-tuned on 2D projections from Human3.6M, is used in the paper.
https://i.imgur.com/CdQONiN.png

The poses are presented as 2D keypoint coordinates in contrast to using heatmaps (i.e. Gaussian operation applied at the keypoint 2D location). Thus, 1D convolutions over the time series are applied, instead of 2D convolutions over heatmaps. The model is a fully convolutional architecture with residual connections that takes a sequence of 2D poses ( concatenated $(x,y)$ coordinates of the joints in each frame) as input and transforms them through temporal convolutions.
https://i.imgur.com/tCZvt6M.png
The `Slice` layer in the residual connection performs padding (or slicing) the sequence with replicas of boundary frames (to both left and right) to match the dimensions with the main block as zero-padding is not used in the convolution operations.

3D pose estimation is a difficult task particularly due to the limited data available online. Therefore, the authors propose semi-supervised approach of training the 2D->3D pose estimation by exploiting unlabeled video. Specifically, 2D keypoints are detected in the unlabeled video with any keypoint detector, then 3D keypoints are predicted from them and these 3D points are reprojected back to 2D (camera intrinsic parameters are required). This is idea similar to cycle consistency in the [CycleGAN](https://junyanz.github.io/CycleGAN/), for instance.
https://i.imgur.com/CBHxFOd.png
In the semi-supervised part (bottom part of the image above) training penalizes when the reprojected 2D keypoints are far from the original input. Weighted mean per-joint position error (WMPJPE) loss, weighted by the inverse of the depth to the object (since far objects should contribute less to the training than close ones) is used as the optimization goal.

The two networks (`supervised` above, `semi-supervised` below) have the same architecture but do not share any weights. They are jointly optimized where `semi-supervised` part serves as a regularizer. They communicate through the path aiming to make sure that the mean bone length of the above and below branches match.

The interesting tendency is observed from the MPJPE analysis with different amounts of supervised and unsupervised data available. Basically, the `semi-supervised` approach becomes more effective when less labeled data is available.
https://i.imgur.com/bHpVcSi.png

Additionally, the error is reduced when the ground truth keypoints are used. This means that a robust and accurate 2D keypoint detector is essential for the accurate 3D pose estimation in this setting.
https://i.imgur.com/rhhTDfo.png

doi.org
sci-hub
scholar.google.com

Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary
Fang, Meng and Cohn, Trevor
Association for Computational Linguistics - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Tim Miller 8 years ago

They get multilingual alignments from dictionaries, then train a Bilstm pos tagger in source language, then automatically tag many tokens in the target language, then manually annotate 1000 tokens in target language, then train a system with combined loss over distant tagging and gold tagging. They add an additional output layer that is learned for the gold annotations.

doi.org
sci-hub
scholar.google.com

MagNet: A Two-Pronged Defense against Adversarial Examples
Meng, Dongyu and Chen, Hao
ACM ACM Conference on Computer and Communications Security - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 6 years ago

Meng and Chen propose MagNet, a combination of adversarial example detection and removal. At test time, given a clean or adversarial test image, the proposed defense works as follows: First, the input is passed through one or multiple detectors. If one of these detectors fires, the input is rejected. To this end, the authors consider detection based on the reconstruction error of an auto-encoder or detection based on the divergence between probability predictions (on adversarial vs. clean example). Second, if not rejected, the input is passed through a reformed. The reformer reconstructs the input, e.g., through an auto-encoder, to remove potentially undetected adversarial noise.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Exploiting local features from deep networks for image retrieval
Ng, Joe Yue-Hei and Yang, Fan and Davis, Larry S.
Conference and Computer Vision and Pattern Recognition - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Vivek Gandhi 9 years ago

In this paper, the authors raise a very important point for instance based image retrieval. For a task like an image recognition features extracted from higher layer of deep networks works really well in general, but for task like instance based image retrieval features extracted from higher layers don't prove to be that useful, so the authors suggest that we take features from lower layer and on those features, apply [VLAD encoding](https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/arandjelovic13.pdf). On top of the VLAD encoding as part of post processing, we perform steps like intra-normalisation and then apply PCA and reduce the encoding to a size of 128 Dimension. The authors have performed their experiments using [Googlenet](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf) and [VGG-16](https://arxiv.org/pdf/1409.1556v6.pdf), and they tried Inception 3a, Inception 4a and Inception 4e on GoogleNet and conv4_2, conv5_1 and conv5_2 on VGG-16. The above mentioned layers has almost similar performance on the dataset they have used. The performance metric used by the authors is Mean Average Precision(MAP).

proceedings.mlr.press
scholar.google.com

Understanding Black-box Predictions via Influence Functions
Koh, Pang Wei and Liang, Percy
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by kangcheng 6 years ago

**Goal**: identifying training points most responsible for a given prediction.

Given training points $z_1, \dots, z_n$, let loss function be $\frac{1}{n}\sum_{i=1}^nL(z_i, \theta)$ 

A function called influence function let us compute the parameter change if $z$ were upweighted by some small $\epsilon$. 
$$\hat{\theta}_{\epsilon, z} := \arg \min_{\theta \in \Theta} \frac{1}{n}\sum_{i=1}^n L(z_i, \theta) + \epsilon L(z, \theta)$$

$$\mathcal{I}_{\text{up, params}}(z) := \frac{d\hat{\theta}_{\epsilon, z}}{d\epsilon} = -H_{\hat{\theta}}^{-1} \nabla_\theta L(z, \hat{\theta})$$

$\mathcal{I}_{\text{up, params}}(z)$ shows how uplifting one point $z$ affect the estimate of the parameters $\theta$. 

Furthermore, we could determine how uplifting $z$ affect the loss estimate of a test point through chain rule. 
$$\mathcal{I}_{\text{up, loss}}(z, z_{\text{test}}) = \nabla_\theta L(z_{\text{test}}, \hat{\theta})^\top \mathcal{I}_{\text{up, params}}(z)$$ 

Apart from lifting one training point, change of the parameters with the change of a training point could also be estimated. 
$$\frac{d\hat{\theta}_{\epsilon, z_\delta, -z}}{d\epsilon} = \mathcal{I}_{\text{up, params}}(z_\delta) - \mathcal{I}_{\text{up, params}}(z)$$
This measures how purturbation $\delta$ to training point $z$ affect the parameter estimation $\theta$.

Section 3 describes some practicals about efficient implementing.

This set of tool could be used for some interpretable machine learning tasks.