Keypoint detection is an important step in various tasks such as SLAM, panorama stitching, camera calibration, and more. Efficient keypoint detectors, FAST (Features from Accelerated and Segments Test) for example, would detect keypoints where a relatively high brightness change is observed in relation to surrounding pixels. Most probably, the keypoints would be located on edges, as shown below: https://i.imgur.com/ylC4BM3.jpg Let's consider another image shown below. Here, while the detector is capable of detecting many keypoints, they are mostly located on trees (see subfigure (a) below). This causes redundancy and this paper focuses on solving it by selecting locally strong keypoints that are well distributed all over the image (subfigure (c)). https://i.imgur.com/1MqZhmT.png The algorithm requires input keypoints to be sorted in decreasing order of strength. The keypoints are then processed in that order and points that fall within the suppression range of a current keypoint are removed from the consideration. The process is repeated for the next unsuppressed keypoint in the sorted order. The process continues until no points remain. If the number of filtered points is not close enough to what we require, the suppression range is modified and the process is repeated. The suppression range is modified using binary search. https://i.imgur.com/qryscZP.png Binary search requires lower and upper bounds to operate. Naive initialization would be setting lower bound to 1 pixel, while upper bound - image width or height. On the other hand, this paper proposes better initialization of the suppression range that is dependent on image height $H_I$, width $W_I$, number of input keypoints $n$ as well as the number of output keypoints $m$. * Upper bound: $\frac{H_I + W_I + 2m - \sqrt{\Delta}}{2m - 1}$ (see paper for more details) * Lower bound: $\frac{1}{2}\sqrt{\frac{n}{m}}$ This initializing helps to decrease the number of iterations to convergence by a factor of three: https://i.imgur.com/7NCpgbi.png Homogenous location of keypoints is beneficial for SLAM algorithm and reduce translational and rotational errors compared to naive filtering when evaluated on KITTI: https://i.imgur.com/4wq0kLK.png Overall, this paper proposes a fast, efficient, and effective method to post-process noisy and redundant keypoint detections with a clear benefit to SLAM. The codes are publically available in multiple languages: C++, Python, Java, and Matlab. See https://github.com/BAILOOL/ANMS-Codes |
This paper proposes an approach to measure motion similarity between human-human and human-object interaction. The authors claim that human activities are usually defined by the interaction between individual characters, such as a high-five interaction. As the interaction datasets are not available authors provide multiple small-scale interaction datasets: https://i.imgur.com/P815TYu.png where: - 2C = a Character-Character (2C) database using kick-boxing motions - CRC = Character-Retargeted Character where the size of characters is adjusted while maintaining the nature of the interaction https://i.imgur.com/XX1WNpO.png - HOI = Human-Object Interaction where a Chair is used as an object https://i.imgur.com/Z6cxd7R.png - 2PB = 2 People Boxing https://i.imgur.com/yUxmpY5.png - 2PD = 2 People Daily Interaction where people represented as a surface point cloud https://i.imgur.com/EzyELg3.png **Methodology** - *Customized Interaction Mesh Structure*. An interaction mesh is created by generating a volumetric mesh using Delaunay Tetrahedralization. Interaction is therefore represented by a series of interaction meshes. To reduce the bias of unequal number of points per human body part, synthetic points (shown in blue) are derived from available skeleton structure (shown in red): https://i.imgur.com/jtlrH49.png The edges after Delaunay Tetrahedralization are filtered in a way that all edges connecting to the same character are removed, as they do not contribute to the interaction. https://i.imgur.com/nWeUNl1.png The temporal sequence of interaction is a series of interaction meshes. - *Distance between interaction meshes*. - Distance between two interaction meshes of two-character interactions: $d(e_i, e_j) = (|e_{i1} - e_{j1}| + |e_{i2} - e_{j2}|) \times \frac{1}{2} (1 - cos\theta),$ https://i.imgur.com/jMHGx3o.png where $e_{i1}$ and $e_{i2}$ are two endpoints of an edge. - *Earth Mover's Distance*. Earth Mover’s Distance (EMD) is used to find the best correspondence between the input interaction meshes to achieve the comparison of two interactions with different semantic meaning: $EMD(E_I^{t_I}, E_J^{tJ}) = \frac{D(E_I^{t_I}, E_J^{tJ})}{\sum_{i=1}^{m} \sum_{j=1}^{n} f_{i,j}^*},$ where $D(E_I^{t_I}, E_J^{tJ}) = \sum_{i=1}^{m} \sum_{j=1}^{n} d(e_i, e_j) f_{i,j}^*$ represents the minimum distance between two interaction meshes and $f_{i,j}^*$ is the optimal set of flow values returned by the mass transport solver that finds the optimal edge-level correspondence between two interaction meshes. The concept of the mass transport solver is visualized below: https://i.imgur.com/wmRhcxP.png - *Distance between interactions sequences*. - spatial normalization - removing its pelvis translation and its horizontal facing angle in each frame - temporal sampling - non-linear sampling strategy based on the frame distance measured by EMD. The sampling algorithm samples fewer in temporal regions with high similarity, which contribute less to the context of the interaction. - temporal alignment - keyframes are aligned using Dynamic Time Warping (DTW) Possible functionality with the proposed method: - Interaction motion similarity analysis - Interaction motion retrieval Notice: >The pre-process took 1.5 hours, 0.5 hour and 4 hours for the 2C, CRC and HOI databases respectively. Given the meshes, computing the distance between two interactions took 0.2 seconds on average. Discussion: - the algorithm focuses on boxing/kickboxing as they have clear rules. Extending the proposed algorithm for measuring motion similarity for daily activities would require careful annotation. - While the method, in theory, can be applied to single human activities, it quite clear that it would perform worse than other baselines for the task. This implies that it is better not applied for use cases where a comparison between two independent movements having no interaction with each other (dancing, yoga) is required. - The method works best for close interaction activities as the edges in the interaction mesh would tend to have a similar structure (e.g. edge length) in case of distant interacting objects. - Application-wise, the algorithm is not suitable for online real-time use. |
This paper presents a method to extract motion (dynamic) and skeleton / camera-view (static) representations from the video of a person represented as a 2D joints skeleton. This decomposition allows transferring the motion to different skeletons (retargeting) and many more. It does so by utilizing deep neural networks. https://i.imgur.com/J5jBzcs.png The architecture consists of motion and skeleton / camera-view encoders that decompose an input sequence of 2D joint positions into latent spaces and a decoder that reconstructs a sequence from such components. The motion vector varies in length, while skeleton and camera view representations are fixed. https://i.imgur.com/QaDksg1.png This is achieved by the nature of the network design. Specifically, motion encoder uses 1D convolutions with strides, thus output dimensions are proportionally related to the input. On the other hand, the static encoder uses global average pooling in the final layer to produce a fixed-size latent representation: https://i.imgur.com/Cf7TVKA.png More detailed design of the encoders and decoder is shown below: https://i.imgur.com/cpaveFm.png **Dataset**. Adobe Mixamo is used to obtain sequences of poses of different 3D characters. It allows creating multiple samples where different characters (with different skeleton structure) perform the same motions. These 3D video clips are then projected into 2D by selecting arbitrary view angles and distance to the object. Thus, we can easily create multiple pairs of 2D image sequences of characters (same or different) performing various actions (same or different) from various views. **Loss functions** used to for training (refer the paper for the detailed formulas): - *Cross Reconstruction Loss* It is a sum of two other losses. The first one is the reconstruction loss where the network tries to reconstruct original input. The second one is cross reconstruction loss where the network tries to reconstruct the sequence where a different character performs the exact same action as the input. It is best shown in the Figure below: https://i.imgur.com/ewZOAox.png - *Triplet Loss* This loss aims to bring latent spaces of similar motions closer together, while separate apart the ones that are different. It takes two triplets, where each contains two samples that share the same (or very similar) motion and one with different. The same concept is applied to the static latent space. - Foot velocity loss This loss helps to remove the foot skating phenomenon - hands and feet exhibit larger errors that the other keypoints. https://i.imgur.com/DclJEde.png where $V_{global}$ and $V_{joint_n}$ extract the global and local ($n$th joint) velocities from the reconstructed output $\hat{p}_{ij}$, respectively, and map them back to the image units, and $V_{orig_n}$ returns the original global velocity of the $n$th joint from the ground truth, $p_{ij}$ **Normalization** - subtract the root position from all joint locations in every frame - subtract the mean joint position and divide by the standard deviation (averaged over the entire dataset) - per-frame global velocity is not touched **Data Augmentation** applied during training: - temporal clipping during the batch creation process - scaling - same as to use different camera distance to the object - flipping symmetrical joints - dropping joints to simulate behavior of a real keypoint detector as they often miss some joints - adding real video data to the training and use reprojection loss in case no labels are given **Results and Evaluation** (to be continued) ... While the summary becomes too long to be a called a summary it is worth mentioning that there are several applications possible with this approach: - performance cloning - make any 2D skeleton repeat particular motions - motion retrieval - search videos that contain the particular target motion |
This paper proposes a 3D human pose estimation in video method based on the dilated temporal convolutions applied on 2D keypoints (input to the network). 2D keypoints can be obtained using any person keypoint detector, but Mask R-CNN with ResNet-101 backbone, pre-trained on COCO and fine-tuned on 2D projections from Human3.6M, is used in the paper. https://i.imgur.com/CdQONiN.png The poses are presented as 2D keypoint coordinates in contrast to using heatmaps (i.e. Gaussian operation applied at the keypoint 2D location). Thus, 1D convolutions over the time series are applied, instead of 2D convolutions over heatmaps. The model is a fully convolutional architecture with residual connections that takes a sequence of 2D poses ( concatenated $(x,y)$ coordinates of the joints in each frame) as input and transforms them through temporal convolutions. https://i.imgur.com/tCZvt6M.png The `Slice` layer in the residual connection performs padding (or slicing) the sequence with replicas of boundary frames (to both left and right) to match the dimensions with the main block as zero-padding is not used in the convolution operations. 3D pose estimation is a difficult task particularly due to the limited data available online. Therefore, the authors propose semi-supervised approach of training the 2D->3D pose estimation by exploiting unlabeled video. Specifically, 2D keypoints are detected in the unlabeled video with any keypoint detector, then 3D keypoints are predicted from them and these 3D points are reprojected back to 2D (camera intrinsic parameters are required). This is idea similar to cycle consistency in the [CycleGAN](https://junyanz.github.io/CycleGAN/), for instance. https://i.imgur.com/CBHxFOd.png In the semi-supervised part (bottom part of the image above) training penalizes when the reprojected 2D keypoints are far from the original input. Weighted mean per-joint position error (WMPJPE) loss, weighted by the inverse of the depth to the object (since far objects should contribute less to the training than close ones) is used as the optimization goal. The two networks (`supervised` above, `semi-supervised` below) have the same architecture but do not share any weights. They are jointly optimized where `semi-supervised` part serves as a regularizer. They communicate through the path aiming to make sure that the mean bone length of the above and below branches match. The interesting tendency is observed from the MPJPE analysis with different amounts of supervised and unsupervised data available. Basically, the `semi-supervised` approach becomes more effective when less labeled data is available. https://i.imgur.com/bHpVcSi.png Additionally, the error is reduced when the ground truth keypoints are used. This means that a robust and accurate 2D keypoint detector is essential for the accurate 3D pose estimation in this setting. https://i.imgur.com/rhhTDfo.png |
This paper is a top-down (i.e. requires person detection separately) pose estimation method with a focus on improving high-resolution representations (features) to make keypoint detection easier. During the training stage, this method utilizes annotated bounding boxes of person class to extract ground truth images and keypoints. The data augmentations include random rotation, random scale, flipping, and [half body augmentations](http://presentations.cocodataset.org/ECCV18/COCO18-Keypoints-Megvii.pdf) (feeding upper or lower part of the body separately). Heatmap learning is performed in a typical for this task approach of applying L2 loss between predicted keypoint locations and ground truth locations (generated by applying 2D Gaussian with std = 1). During the inference stage, pre-trained object detector is used to provide bounding boxes. The final heatmap is obtained by averaging heatmaps obtained from the original and flipped images. The pixel location of the keypoint is determined by $argmax$ heatmap value with a quarter offset in the direction to the second-highest heatmap value. While the pipeline described in this paper is a common practice for pose estimation methods, this method can achieve better results by proposing a network design to extract better representations. This is done through having several parallel sub-networks of different resolutions (next one is half the size of the previous one) while repeatedly fusing branches between each other: https://raw.githubusercontent.com/leoxiaobin/deep-high-resolution-net.pytorch/master/figures/hrnet.png The fusion process varies depending on the scale of the sub-network and its location in relation to others: https://i.imgur.com/mGDn7pT.png |
The method is a multi-task learning model performing person detection, keypoint detection, person segmentation, and pose estimation. It is a bottom-up approach as it first localizes identity-free semantics and then group them into instances. https://i.imgur.com/kRs9687.png Model structure: - **Backbone**. A feature extractor is presented by ResNet-(50 or 101) with one [Feature Pyramid Network](https://arxiv.org/pdf/1612.03144.pdf) (FPN) for keypoint branch and one for person detection branch. FPN enhances extracted features through multi-level representation. - **Keypoint detection** detects keypoints as well as produces a pixel-level segmentation mask. https://i.imgur.com/XFAi3ga.png FPN features $K_i$ are processed with multiple $3\times3$ convolutions followed by concatenation and final $1\times1$ convolution to obtain predictions for each keypoint, as well as segmentation mask (see Figure for details). This results in #keypoints_in_dataset_per_person + 1 output layers. Additionally, intermediate supervision (i.e. loss) is applied at the FPN outputs. $L_2$ loss between predictions and Gaussian peaks at the keypoint locations is used. Similarly, $L_2$ loss is applied for segmentation predictions and corresponding ground truth masks. - **Person detection** is essentially a [RetinaNet](https://arxiv.org/pdf/1708.02002.pdf), a one-stage object detector, modified to only handle *person* class. - **Pose estimation**. Given initial keypoint predictions, Pose Estimation Network (PRN) selects a single keypoint for each class. https://i.imgur.com/k8wNP5p.png During inference, PRN takes cropped outputs from keypoint detection branch defined by the predicted bounding boxes from the person detection branch, resizes it to a fixed size, and forwards it through a multilayer perceptron with residual connection. During the training, the same process is performed, except the cropped keypoints come from the ground truth annotation defined by a labeled bounding box. This model is not an end-to-end trainable model. While keypoint and person detection branches can, in theory, be trained simultaneously, PRN network requires separate training. **Personal note**. Interestingly, PRN training with ground truth inputs (i.e. "perfect" inputs) only reaches 89.4 mAP validation score which is surprisingly quite far from the max possible score. This presumably means that even if preceding networks or branches perform god-like, the PRN might become a bottleneck in the performance. Therefore, more efforts should be directed to PRN itself. Moreover, modifying the network to support end-to-end training might help in boosting the performance. Open-source implementations used to make sure the paper apprehension is correct: [link1](https://github.com/LiMeng95/MultiPoseNet.pytorch), [link2](https://github.com/IcewineChen/pytorch-MultiPoseNet). |
This paper tackles the challenge of action recognition by representing a video as space-time graphs: **similarity graph** captures the relationship between correlated objects in the video while the **spatial-temporal graph** captures the interaction between objects. The algorithm is composed of several modules: https://i.imgur.com/DGacPVo.png 1. **Inflated 3D (I3D) network**. In essence, it is usual 2D CNN (e.g. ResNet-50) converted to 3D CNN by copying 2D weights along an additional dimension and subsequent renormalization. The network takes *batch x 3 x 32 x 224 x 224* tensor input and outputs *batch x 16 x 14 x 14*. 2. **Region Proposal Network (RPN)**. This is the same RPN used to predict initial bounding boxes in two-stage detectors like Faster R-CNN. Specifically, it predicts a predefined number of bounding boxes on every other frame of the input (initially input is 32 frames, thus 16 frames are used) to match the temporal dimension of I3D network's output. Then, I3D network output features and projected on them bounding boxes are passed to ROIAlign to obtain temporal features for each object proposal. Fortunately, PyTorch comes with a [pretrained Faster R-CNN on MSCOCO](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) which can be easily cut to have only RPN functionality. 3. **Similarity Graph**. This graph represents a feature similarity between different objects in a video. Having features $x_i$ extracted by RPN+ROIAlign for every bounding box predictions in a video, the similarity between any pair of objects is computed as $F(x_i, x_j) = (wx_i)^T * (w'x_j)$, where $w$ and $w'$ are learnable transformation weights. Softmax normalization is performed on each edge on the graph connected to a current node $i$. Graph convolutional network is represented as several graph convolutional layers with ReLU activation in between. Graph construction and convolutions can be conveniently implemented using [PyTorch Geometric](https://github.com/rusty1s/pytorch_geometric). 4. **Spatial-Temporal Graph**. This graph captures a spatial and temporal relationship between objects in neighboring frames. To construct a graph $G_{i,j}^{front}$, we need to iterate through every bounding box in frame $t$ and compute Intersection over Union (IoU) with every object in frame $t+1$. The IoU value serves as the weight of the edge connecting nodes (ROI aligned features from RPN) $i$ and $j$. The edge values are normalized so that the sum of edge values connected to proposal $i$ will be 1. In a similar manner, the backward graph $G_{i,j}^{back}$ is defined by analyzing frames $t$ and $t-1$. 5. **Classification Head**. The classification head takes two inputs. One is coming from average pooled features from I3D model resulting in *1 x 512* tensor. The other one is from pooled sum of features (i.e. *1 x 512* tensor) from the graph convolutional networks defined above. Both inputs are concatenated and fed to Fully-Connected (FC) layer to perform final multi-label (or multi-class) classification. **Dataset**. The authors have tested the proposed algorithm on [Something-Something](https://20bn.com/datasets/something-something) and [Charades](https://allenai.org/plato/charades/) datasets. For the first dataset, a softmax loss function is used, while the second one utilizes binary sigmoid loss to handle a multi-label property. The input data is sampled at 6fps, covering about 5 seconds of a video input. **My take**. I think this paper is a great engineering effort. While the paper is easy to understand at the high-level, implementing it is much harder partially due to unclear/misleading writing/description. I have challenged myself with [reproducing this paper](https://github.com/BAILOOL/Videos-as-Space-Time-Region-Graphs). It is work in progress, so be careful not to damage your PC and eyes :-) |
This paper presents a per-frame image-to-image translation system enabling copying of a motion of a person from a source video to a target person. For example, a source video might be a professional dancer performing complicated moves, while the target person is you. By utilizing this approach, it is possible to generate a video of you dancing as a professional. Check the authors' [video](https://www.youtube.com/watch?v=PCBTZh41Ris) for the visual explanation. **Data preparation** The authors have manually recorded high-resolution video ( at 120fps ) of a person performing various random moves. The video is further decomposed to frames, and person's pose keypoints (body joints, hands, face) are extracted for each frame. These keypoints are further connected to form a person stick figure. In practice, pose estimation is performed by open source project [OpenPose](https://github.com/CMU-Perceptual-Computing-Lab/openpose). **Training** https://i.imgur.com/VZCXZMa.png Once the data is prepared the training is performed in two stages: 1. **Training pix2pixHD model with temporal smoothing**. The core model is an original [pix2pixHD](https://tcwang0509.github.io/pix2pixHD/)[1] model with temporal smoothing. Specifically, if we were to use vanilla pix2pixHD, the input to the model would be a stick person image, and the target is the person's image corresponding to the pose. The network's objective would be $min_{G} (Loss1 + Loss2 + Loss3)$, where: - $Loss1 = max_{D_1, D_2, D_3} \sum_{k=1,2,3} \alpha_{GAN}(G, D_k)$ is adverserial loss; - $Loss2 = \lambda_{FM} \sum_{k=1,2,3} \alpha_{FM}(G,D_k)$ is feature matching loss; - $Loss3 = \lambda_{VGG}\alpha_{VGG}(G(x),y)]$ is VGG perceptual loss. However, this objective does not account for the fact that we want to generate video composed of frames that are temporally coherent. The authors propose to ensure *temporal smoothing* between adjacent frames by including pose, corresponding image, and generated image from the previous step (zero image for the first frame) as shown in the figure below: https://i.imgur.com/0NSeBVt.png Since the generated output $G(x_t; G(x_{t-1}))$ at time step $t$ is now conditioned on the previously generated frame $G(x_{t-1})$ as well as current stick image $x_t$, better temporal consistency is ensured. Consequently, the discriminator is now trying to determine both correct generation, as well as temporal consitency for a fake sequence $[x_{t-1}; x_t; G(x_{t-1}), G(x_t)]$. 2. **Training FaceGAN model**. https://i.imgur.com/mV1xuMi.png In order to improve face generation, the authors propose to use specialized FaceGAN. In practice, this is another smaller pix2pixHD model (with a global generator only, instead of local+global) which is fed with a cropped face area of a stick image and cropped face area of a corresponding generated image (from previous step 1) and aims to generate a residual which is added to the previously generated full image. **Testing** During testing, we extract frames from the input video, obtain pose stick image for each frame, normalize the stick pose image and feed it to pix2pixHD (with temporal consistency) and, further, to FaceGAN to produce final generated image with improved face features. Normalization is needed to capture possible pose variation between a source and a target input video. **Remarks** While this method produces a visually appealing result, it is not perfect. The are several reasons for being so: 1. *Quality of a pose stick image*: if the pose detector "misses" the keypoint, the generator might have difficulties to generate a properly rendered image; 2. *Motion blur*: motion blur causes pose detector to miss keypoints; 3. *Severe scale change*: if source person is very far, keypoint detector might fail to detect proper keypoints. Among video rendering challenges, the authors mention self-occlusion, cloth texture generation, video jittering (training-test motion mismatch). References: [1] "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs"
This paper introduces a CNN based segmentation of an object that is defined by a user using four extreme points (i.e. bounding box). Interestingly, in a related work, it has been shown that clicking extreme points is about 5 times more efficient than drawing a bounding box in terms of speed. https://i.imgur.com/9GJvf17.png The extreme points have several goals in this work. First, they are used as a bounding box to crop the object of interest. Secondly, they are utilized to create a heatmap with activations in the regions of extreme points. The heatmap is created as a 2D Gaussian centered around each of the extreme points. This heatmap is matched to the size of the resized crop (i.e. 512x512) and is concatenated with the original RGB channels of the crop. The concatenated input of channel depth=4 is fed to the network which is a ResNet-101 with FC and last two maxpool layers removed. In order to maintain the same receptive field, an astrous convolution is used. Pyramid scene parsing module from PSPNet is used to aggregate global context. The network is trained with a standard cross-entropy loss weighted by a normalization factor (i.e. a frequency of a class in a dataset). How does it compare to "Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++ " paper in terms of accuracy? Specifically, if the polygon is wrong it is easy to correct points on the polygon that are wrong. However, it is unclear how to obtain preferred segmentation when no matter how many (greater than four) extreme points are selected, the object of interest is not segmented properly. |
In this paper, the authors develop a system for automatic as well as an interactive annotation (i.e. segmentation) of a dataset. In the automatic mode, bounding boxes are generated by another network (e.g. FasterRCNN), while in the interactive mode, the input bounding box around an object of interest comes from the human in the loop. The system is composed of the following parts: https://github.com/davidjesusacu/polyrnn-pp/raw/master/readme/model.png 1. **Residual encoder with skip connections**. This step acts as a feature extractor. The ResNet-50 with few modifications (i.e. reducing stride, usage of dilation, removal of average pooling and FC layers) serve as a base CNN encoder. Instead of utilizing the last features of the network, the authors concatenate outputs from different layers - resized to highest feature resolution - to capture multi-level representations. This is shown in the figure below: https://www.groundai.com/media/arxiv_projects/226090/x4.png.750x0_q75_crop.png 2. **Recurrent decoder** is a two-layer ConvLSTM which takes image features, previous (or first) vertex position and outputs one-hot encoding of 28x28 representing possible vertex position, +1 indicates that the polygon is closed (i.e. the end of the sequence). Attention weight per location is utilized using CNN features, 1st and 2nd layers of ConvLSTM. Training is formulated as reinforcement learning since recurrent decoder is considered as sequential decision-making agent. The reward function is IoU between mask generated by the enclosed polygon and ground-truth mask. 3. **Evaluator network** chooses the best polygon among multiple candidates. CNN features, last state tensor of ConvLSTM, and the predicted polygon are used as input, and the output is the predicted IoU. The best polygon is selected from the polygons which are generated using 5 top scoring first vertex predictions. https://i.imgur.com/84amd98.png 4. **Upscaling with Graph Neural Network** takes the list of vertices generated by ConvLSTM decode, adds a node in between two consecutive nodes (to produce finer details at higher resolution), and aims to predict relative offset of each node at a higher resolution. Specifically, it extracts features around every predicted vertex and forwards it through GGNN (Gated Graph Neural Network) to obtain the final location (i.e. offset) of the vertex (treated as classification task). https://www.groundai.com/media/arxiv_projects/226090/x5.png.750x0_q75_crop.png The whole system is not trained end-to-end. While the network was trained on CityScapes dataset, it has shown reasonable generalization to different modalities (e.g. medical data). It would be very nice to observe the opposite generalization of the model. Meaning you train on medical data and see how it performs on CityScapes data. |
This paper is about interactive Visual Question Answering (VQA) setting in which agents must ask questions about images to learn. This closely mimics how people learn from each other using natural language and has a strong potential to learn much faster with fewer data. It is referred as learning by asking (LBA) through the paper. The approach is composed of three models: http://imisra.github.io/projects/lba/approach_HQ.jpeg 1. **Question proposal module** is responsible for generating _important_ questions about the image. It is a combination of 2 models: - **Question generator** model produces a question. It is LSTM that takes image features and question type (random choice from available options) as input and outputs a question. - **Question relevance** model that selects questions relevant to the image. It is a stacked attention architecture network (shown below) that takes in generated question and image features and filters out irrelevant to the image questions. https://i.imgur.com/awPcvYz.png 2. **VQA module** learns to predict answer given the image features and question. It is implemented as stacked attention architecture shown above. 3. **Question selection module** selects the most informative question to ask. It takes current state of VQA module and its output to calculate expected accuracy improvement (details are in the paper) to measure how fast the VQA module has a potential to improve for each answer. The single question selection (i.e. best question for VQA to improve the fastest) strategy is based on epsilon-greedy policy. This method (i.e. LBA) is shown to be about 50% more data efficient than naive VQA method. As an interesting future direction of this work, the authors propose to use real-world images and include a human in the training as an answer provider. |
This paper performs pixel-wise segmentation of the object of interest which is specified by a sentence. The model is composed of three main components: a **textual encoder**, a **video encoder**, and a **decoder**.https://i.imgur.com/gjbHNqs.png - **Textual encoder** is word2vec pre-trained model followed by 1D CNN. - **Video encoder** is a 3D CNN to obtain a visual representation of the video (can be combined with optical flow to obtain motion information). - **Decoder**. Given a sentence representation $T$ a separate filter $f^r = tanh(W^r_fT + b^r_f)$ is created to match each feature map in the video frame decoder and combined with visual features as $S^r_t = f^r * V^r_t$, for each $r$esolution at $t$imestep. The decoder is composed of sequence of transpose convolution layers to get the response map of the same size as the input video frame. |
This paper introduces a new AI task - Embodied Question Answering. The goal of this task for an agent is to be able to answer the question by observing the environment through a single egocentric RGB camera while being able to navigate inside the environment. The agent has 4 natural modules: https://i.imgur.com/6Mjidsk.png 1. **Vision**. 224x224 RGB images are processed by CNN to produce a fixed-size representation. This CNN is pretrained on pixel-to-pixel tasks such as RGB reconstruction, semantic segmentation, and depth estimation. 2. **Language**. Questions are encoded with 2-layer LSTMs with 128-d hidden states. Separate question encoders are used for the navigation and answering modules to capture important words for each module. 3. **Navigation** is composed of a planner (forward, left, right, and stop actions) and a controller that executes planner selected action for a variable number of times. The planner is LSTM taking hidden state, image representation, question, and previous action. Contrary, a controller is an MLP with 1 hidden layer which takes planner's hidden state, action from the planner, and image representation to execute an action or pass the lead back to the planner. 4. **Answering** module computes an image-question similarity of the last 5 frames via a dot product between image features (passed through an fc-layer to align with question features) and question encoding. This similarity is converted to attention weights via a softmax, and the attention-weighted image features are combined with the question features and passed through an answer classifier. Visually this process is shown in the figure below. https://i.imgur.com/LeZlSZx.png [Successful results](https://www.youtube.com/watch?v=gVj-TeIJfrk) as well as [failure cases](https://www.youtube.com/watch?v=4zH8cz2VlEg) are provided. Generally, this is very promising work which literally just scratches the surface of what is possible. There are several constraints which can be mitigated to push this field to more general outcomes. For example, use more general environments with more realistic graphics and broader set of questions and answers. |
The goal of this work is to perform transfer learning among numerous tasks and to discover visual relationships among them. Specifically, while we intiutively might guess the depth of an image and surface normals are related, this work takes a step forward and discovers a beneficial relationship among 26 tasks in terms of task transferability - many of them are not obvious. This is important for scenarios when an insufficient budget is available for target task for annotation, thus, learned representation from the 'cheaper' task could be used along with small dataset for the target task to reach sufficient performance on par with fully supervised training on a large dataset. The basis of the approach is to compute an affinity matrix among tasks based on whether the solution for one task can be sufficiently easily used for another task. This approach does not impose human intuition about the task relationships and chooses task transferability based on the quality of a transfer operation in a fully computational manner. The task taxonomy (i.e. **taskonomy**) is a computationally found directed hypergraph that captures the notion of task transferability over any given task dictionary. It performed using a four-step process depicted in the figure below: ![Process overview. The steps involved in creating the taxonomy.](http://taskonomy.stanford.edu/img/svg/Process.svg) - In stage I (**Task-specific Modelling**), a task-specific network is trained in a fully supervised manner. The network is composed of the encoder (modified ResNet-50), and fully convolutional decoder for pixel-to-pixel tasks, or 2-3 FC layers for low-dimensional tasks. Dataset consists of 4 million images of indoor scenes from about 600 buildings; every image has an annotation for every task. - In stage II (**Transfer modeling**), all feasible transfers between sources and targets are trained (multiple inputs task to single target transfer is also considered). Specifically, after the task-specific networks are trained in stage I, the weights of an encoder are fixed (frozen network is used to extract representations only) and the representation from the encoder is used to train a small readout network (similar to a decoder from stage I) with a new task as a target (i.e. ground truth is available). In total, about 3000 transfer possibilities are trained. - In stage III (**Taxonomy solver**), the task affinities acquired from the transfer functions performance are normalized. This is needed because different tasks lie in different spaces and transfer function scale. This is performed using ordinal normalization - Analytical Hierarchy Process (details are in the paper - Section 3.3). This results in an affinity matrix where a complete graph of relationships is completely normalized and this graph quantifies a pair-wise set of tasks evaluated in terms of a transfer function (i.e. task dependency). - In stage IV (**Computed Taxonomy**), a hypergraph which can predict the performance of any transfer policy and optimize for the optimal one is synthesized. This is solved using Binary Integer Program as a subgraph selection problem where tasks are nodes and transfers are edges. After the optimization process, the solution devices a connectivity that solves all target tasks, maximizes their collective performance while using only available source tasks under user-specified constraints (e.g. budget). So, if you want to train your network on an unseen task, you can obtain pretrained weights for existing tasks from the [project page](https://github.com/StanfordVL/taskonomy/tree/master/taskbank), train readout functions against each task (as well as combination of multiple inputs), build an affinity matrix to know where your task is positioned against the other ones, and through subgraph selection procedure observe what tasks have favourable influence on your task. Consequently, you can train your task with much less data by utilizing representations from the existing tasks which share visual significance with your task. Magnificent! |
This paper estimate 3D hand shape from **single** RGB images based on deep learning. The overall pipeline is the following: https://i.imgur.com/H72P5ns.png 1. **Hand Segmentation** network is derived from this [paper](https://arxiv.org/pdf/1602.00134.pdf) but, in essence, any segmentation network would do the job. Hand image is cropped from the original image by utilizing segmentation mask and resized to a fixed size (256x256) with bilinear interpolation. 2. **Detecting hand keypoints**. 2D Keypoint detection is formulated as predicting score map for each hand joints (fixed size = 21). Encoder-decoder architecture is used. 3. **3D hand pose estimation**. https://i.imgur.com/uBheX3o.png - In this paper, the hand pose is represented as $w_i = (x_i, y_i, z_i)$, where $i$ is index for a particular hand joint. This representation is further normalized $w_i^{norm} = \frac{1}{s} \cdot w_i$, where $s = ||w_{k+1} - w_{k} ||$, and relative position to a reference joint $r$ (palm) is obtained as $w_i^{rel} = w_i^{norm} - w_r^{norm}$. - The network predicts coordinates within a canonical frame and additionally estimate the transformation into the canonical frame (as opposite to predicting absolute 3D coordinates). Therefore, the network predicts $w^{c^*} = R(w^{rel}) \cdot w^{rel}$ and $R(w^{rel}) = R_y \cdot R_{xz}$. Information whether left/right hand is the input is concatenated to flattened feature representation. The training loss is composed of a separate term for canonical coordinates and canonical transformation matrix L2 losses. Contribution: - Apparently, the first method to perform 3D hand shape estimation from a single RGB image rather than using both RGB and depth sensors; - Possible extension to sign language recognition problem by attaching classifier on predicted 3D poses. While this approach quite accurately predicts hand 3D poses among frames, they often fluctuate among frames. Probably several techniques (i.e. optical flow, RNN, post-processing smoothing) can be used for ensuring temporal consistency and make predictions more stable across frames. |
Given microscopy cell data, this work tries to determine the number of cells in the image. The whole pipeline is composed of two steps: 1. Cells segmentation: - Feature Pyramid Network is used for generating a foreground mask; - The last output of FPN is used for predicting mean foreground masks and aleatoric uncertainty masks. Each mask in both outputs is trained with aleatoric loss $ \frac{||y_{pred} - y_{gt} ||^2}{2\sigma} + \log{2\sigma}$ and [total-variational](https://en.wikipedia.org/wiki/Total_variation_denoising) loss. https://i.imgur.com/ssTuGVe.png 2. Cell counting: - VGG-11 network is used as a feature extractor from the predicted foreground segmentation masks. There are two output branches following VGG: cell count branch and estimated variance branch. Training is done using L2 loss function with aleatoric uncertainty for cell counts. https://i.imgur.com/aijZn7e.png While the idea to utilize neural networks to count cells in the image seems fascinating, the real benefit of such system in production is quite questionable. Specifically, why would you need to add a VGG-like feature extractor on top of already predicted cell segmentation masks, if you could simply do more work in segmentation network (i.e. separate cells better, predict objectness/contour) and get the number of cells directly from the predicted masks? |
This paper synthesizes a high-quality video of Barack Obama given the audio. Practically, it only synthesizes the region around the mouth, while the rest of the elements (i.e. pixels) come from a video in a database. The overall pipeline is the following: - Given a video, an audio and a mouth shape are extracted. Audio is represented as MFCC coefficients; mouth shape - 18 lip markers; - Train audio to mouth shape mapping with time-delayed unidirectional LSTM. - Synthesize mouth texture: retrieve a number of video frames in a database where a mouth shape is similar to the output of LSTM; synthesize median texture by applying weighted median on mouth shapes from retrieved video frames; manually select teeth target frame (selection criteria are purely subjected) and enhance teeth median texture with selected teeth target frame. - Re-timing to avoid situations where Obama is not speaking but his head is moving which looks very unnatural. - Final composition into the target video involves jaw correction to make it more natural. ![Algorithm flow](http://www.kurzweilai.net/images/Obama-lip-Sync-Graphic.jpg) The results look ridiculously natural. Authors suggest that one of the applications of this paper is speech summarization, where you summarize a speech not only with selected parts as text and audio but also synthesize a video for it. Personally, this work inspires me to work on a method that is able to generate natural sign language interpreter that takes sound/text as input and produces sign language moves. |
This paper tackles a challenging task of hand shape and continuous Sign Language Recognition (SLR) directly from images obtained from a common RGB camera (rather than utilizing motion sensors like Kinect). The basic idea is to create a network that is end-to-end trainable with input (i.e. images) and output (i.e. hand shape labels, word labels) sequences. The network is composed of three parts: - CNN as a feature extractor - Bidirectional LSTMs for temporal modeling - Connectionist Temporal Classification as a loss layer ![Network structure](https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/3269d3541f0eec006aee6ce086db2665b7ded92d/1-Figure1-1.png) Results: - Observed state-of-art results (at the time of publishing) on "One-Million Hands" and "RWTH-PHOENIX-Weather-2014" datasets. - Utilizing full images rather than hand patches provides better performance for continuous SLR. - A network that recognizes hand shape and a network that recognizes word sequence can be combined and trained together to recognize word sequences. Finetuning combined system from for all layers works better than fixing "feature extraction" layers. - Combination of two networks where each network trained on separate task performs slightly better than training each network on word sequences. - Marginal difference in performance observed for different decoding and post-processing techniques during sequence-to-sequence predictions. |