[link]
This paper proposes a 3D human pose estimation in video method based on the dilated temporal convolutions applied on 2D keypoints (input to the network). 2D keypoints can be obtained using any person keypoint detector, but Mask RCNN with ResNet101 backbone, pretrained on COCO and finetuned on 2D projections from Human3.6M, is used in the paper. https://i.imgur.com/CdQONiN.png The poses are presented as 2D keypoint coordinates in contrast to using heatmaps (i.e. Gaussian operation applied at the keypoint 2D location). Thus, 1D convolutions over the time series are applied, instead of 2D convolutions over heatmaps. The model is a fully convolutional architecture with residual connections that takes a sequence of 2D poses ( concatenated $(x,y)$ coordinates of the joints in each frame) as input and transforms them through temporal convolutions. https://i.imgur.com/tCZvt6M.png The `Slice` layer in the residual connection performs padding (or slicing) the sequence with replicas of boundary frames (to both left and right) to match the dimensions with the main block as zeropadding is not used in the convolution operations. 3D pose estimation is a difficult task particularly due to the limited data available online. Therefore, the authors propose semisupervised approach of training the 2D>3D pose estimation by exploiting unlabeled video. Specifically, 2D keypoints are detected in the unlabeled video with any keypoint detector, then 3D keypoints are predicted from them and these 3D points are reprojected back to 2D (camera intrinsic parameters are required). This is idea similar to cycle consistency in the [CycleGAN](https://junyanz.github.io/CycleGAN/), for instance. https://i.imgur.com/CBHxFOd.png In the semisupervised part (bottom part of the image above) training penalizes when the reprojected 2D keypoints are far from the original input. Weighted mean perjoint position error (WMPJPE) loss, weighted by the inverse of the depth to the object (since far objects should contribute less to the training than close ones) is used as the optimization goal. The two networks (`supervised` above, `semisupervised` below) have the same architecture but do not share any weights. They are jointly optimized where `semisupervised` part serves as a regularizer. They communicate through the path aiming to make sure that the mean bone length of the above and below branches match. The interesting tendency is observed from the MPJPE analysis with different amounts of supervised and unsupervised data available. Basically, the `semisupervised` approach becomes more effective when less labeled data is available. https://i.imgur.com/bHpVcSi.png Additionally, the error is reduced when the ground truth keypoints are used. This means that a robust and accurate 2D keypoint detector is essential for the accurate 3D pose estimation in this setting. https://i.imgur.com/rhhTDfo.png 
[link]
This paper is a topdown (i.e. requires person detection separately) pose estimation method with a focus on improving highresolution representations (features) to make keypoint detection easier. During the training stage, this method utilizes annotated bounding boxes of person class to extract ground truth images and keypoints. The data augmentations include random rotation, random scale, flipping, and [half body augmentations](http://presentations.cocodataset.org/ECCV18/COCO18KeypointsMegvii.pdf) (feeding upper or lower part of the body separately). Heatmap learning is performed in a typical for this task approach of applying L2 loss between predicted keypoint locations and ground truth locations (generated by applying 2D Gaussian with std = 1). During the inference stage, pretrained object detector is used to provide bounding boxes. The final heatmap is obtained by averaging heatmaps obtained from the original and flipped images. The pixel location of the keypoint is determined by $argmax$ heatmap value with a quarter offset in the direction to the secondhighest heatmap value. While the pipeline described in this paper is a common practice for pose estimation methods, this method can achieve better results by proposing a network design to extract better representations. This is done through having several parallel subnetworks of different resolutions (next one is half the size of the previous one) while repeatedly fusing branches between each other: https://raw.githubusercontent.com/leoxiaobin/deephighresolutionnet.pytorch/master/figures/hrnet.png The fusion process varies depending on the scale of the subnetwork and its location in relation to others: https://i.imgur.com/mGDn7pT.png 
[link]
# Overview This paper presents a novel way to align frames in videos of similar actions temporally in a selfsupervised setting. To do so, they leverage the concept of cycleconsistency. They introduce two formulations of cycleconsistency which are differentiable and solvable using standard gradient descent approaches. They name their method Temporal Cycle Consistency (TCC). They introduce a dataset that they use to evaluate their approach and show that their learned embeddings allow for few shot classification of actions from videos. Figure 1 shows the basic idea of what the paper aims to achieve. Given two video sequences, they wish to map the frames that are closest to each other in both sequences. The beauty here is that this "closeness" measure is defined by nearest neighbors in an embedding space, so the network has to figure out for itself what being close to another frame means. The cycleconsistency is what makes the network converge towards meaningful "closeness". ![image](https://userimages.githubusercontent.com/18450628/6388819068b8a500c9ac11e99da7925b72c731c3.png) # Cycle Consistency Intuitively, the concept of cycleconsistency can be thought of like this: suppose you have an application that allows you to take the photo of a user X and increase their apparent age via some transformation F and decrease their age via some transformation G. The process is cycleconsistent if you can age your image, then using the aged image, "deage" it and obtain something close to the original image you took; i.e. F(G(X)) ~= X. In this paper, cycleconsistency is defined in the context of nearestneighbors in the embedding space. Suppose you have two video sequences, U and V. Take a frame embedding from U, u_i, and find its nearest neighbor in V's embedding space, v_j. Now take the frame embedding v_j and find its closest frame embedding in U, u_k, using a nearestneighbors approach. If k=i, the frames are cycle consistent. Otherwise, they are not. The authors seek to maximize cycle consistency across frames. # Differentiable Cycle Consistency The authors present two differentiable methods for cycleconsistency; cycleback classification and cycleback regression. In order to make their cycleconsistency formulation differentiable, the authors use the concept of soft nearest neighbor: ![image](https://userimages.githubusercontent.com/18450628/638910615477a680c9b211e99e4f55e11d81787d.png) ## cycleback classification Once the soft nearest neighbor v~ is computed, the euclidean distance between v~ and all u_k is computed for a total of N frames (assume N frames in U) in a logit vector x_k and softmaxed to a prediction ลท = softmax(x): ![image](https://userimages.githubusercontent.com/18450628/6389145038c0d000c9b311e989e9d257be3fd175.png). ![image](https://userimages.githubusercontent.com/18450628/63891746e92ed400c9b311e9982c078ebd1d747e.png) Note the clever use of the negative, which will ensure the softmax selects for the highest distance. The ground truth label is a onehot vector of size 1xN where position i is set to 1 and all others are set to 0. Crossentropy is then used to compute the loss. ## cycleback regression The concept is very similar to cycleback classification up to the soft nearest neighbor calculation. However the similarity metric of v~ back to u_k is not computed using euclidean distance but instead soft nearest neighbor again: ![image](https://userimages.githubusercontent.com/18450628/638932319bb46600c9b711e991450c13e8ede5e6.png) The idea is that they want to penalize the network less for "close enough" guesses. This is done by imposing a gaussian prior on beta centered around the actual closest neighbor i. ![image](https://userimages.githubusercontent.com/18450628/6389343929905100c9b811e981fefab238021c6d.png) The following figure summarizes the pipeline: ![image](https://userimages.githubusercontent.com/18450628/638962789f4beb00c9bf11e98be15f1ad67199c7.png) # Datasets All training is done in a selfsupervised fashion. To evaluate their methods, they annotate the Pouring dataset, which they introduce and the Penn Action dataset. To annotate the datasets, they limit labels to specific events and phases between phases ![image](https://userimages.githubusercontent.com/18450628/63894846affa6200c9bb11e989192f2cdf720a88.png) # Model The network consists of two parts: an encoder network and an embedder network. ## Encoder They experiment with two encoders: * ResNet50 pretrained on imagenet. They use conv_4 layer's output which is 14x14x1024 as the encoder scheme. * A Vgglike model from scratch whose encoder output is 14x14x512. ## Embedder They then fuse the k following frame encodings along the time dimension, and feed it to an embedder network which uses 3D Convolutions and 3D pooling to reduce it to a 1x128 feature vector. They find that k=2 works pretty well.
1 Comments
