The goal of this paper is to learn a model that embeds 2D keypoints(the locations of specific key body parts in 2D space) representing a particular pose into a vector embedding where nearby points in embedding space are also nearby in 3D space. This sort of model is useful because the same 3D pose can generate a wide variety of 2D pose projections, and it can be useful to learn which apparently-distinct representations actually map to the same 3D pose. To do this, the basic approach used by the authors (with has a few variants), is - Take a dataset of 3D poses, and corresponding 2D projections - Define a notion of "matching" 3D poses, based on a parameter kappa, which designates the maximum average per-joint distance at which two 3D poses can be considered the same - Construct triplets composed of an anchor pose, a "positive" pose (a different 2D pose with a matching 3D pose), and a "negative" pose (some other 2D pose sampled from the dataset using a strategy that explicitly seeks out hard negative examples) - Calculate a triplet loss, that pushes positive examples closer together, and pulls negative examples farther apart. This is specifically done by defining a probabilistic representation of p(match | z1, z2), or, the probability of a match in 3D space given the embeddings of the two 2D poses. This is parametrized using a sigmoid with trainable parameters, as shown below https://i.imgur.com/yFCCVuA.png - They they calculate a distance kernel as the negative log of that probability, and calculate the basic triplet loss, which tries to maximize the diff between the the distance between negative examples, and the distance between positive examples. - They also add an additional loss further incentivizing the match probability to be higher on the positive pair (in addition to just pushing the positive and negative pair further apart) - The final loss is a Gaussian prior loss, incentivizing the learned embeddings z to be in the shape of a Gaussian https://i.imgur.com/SxvcvJG.png This represents the central shape of the method. Some additional ablations include: - Camera Augmentation: Creational additional triplets by taking existing 3D poses and generating artificial pairs of 2D poses at different camera views - Temporal Pose Embedding - Embedding multiple temporally connected pose, rather than just a single one - Keypoint Dropout - To simulate situations where some keypoints are occluded, the authors tried training with some keypoints dropped out, either keypoints selected at random, or selected jointly and non-independently based on a model of which keypoints are likely to be occluded together The authors found that their method was generally quite a bit stronger that prior approaches for the task of querying similar 3D poses given a 2D pose input, including some alternate methods that do direct 3D estimation.