First published: 2018/07/05 (2 years ago) Abstract: This paper presents KeypointNet, an end-to-end geometric reasoning framework
to learn an optimal set of category-specific 3D keypoints, along with their
detectors. Given a single image, KeypointNet extracts 3D keypoints that are
optimized for a downstream task. We demonstrate this framework on 3D pose
estimation by proposing a differentiable objective that seeks the optimal set
of keypoints for recovering the relative pose between two views of an object.
Our model discovers geometrically and semantically consistent keypoints across
viewing angles and instances of an object category. Importantly, we find that
our end-to-end framework using no ground-truth keypoint annotations outperforms
a fully supervised baseline using the same neural network architecture on the
task of pose estimation. The discovered 3D keypoints on the car, chair, and
plane categories of ShapeNet are visualized at http://keypointnet.github.io/.
What the paper is about:
KeypointNet learns the optimal set of 3D keypoints and their 2D detectors for a specified downstream task. The authors demonstrate this by extracting 3D keypoints and their 2D detectors for the task of relative pose estimation across views. They show that, using keypoints extracted by KeypointNet, relative pose estimates are superior to ones that are obtained from a supervised set of keypoints.
Approach:
Training samples for KeypointNet comprise two views (images) of an object. The task is to then produce an ordered list of 3D keypoints that, upon orthogonal procrustes alignment, produce the true relative 3D pose across those views. The network has N heads, each of which extracts one (3D) keypoint (from a 2D image). There are two primary loss terms. A multi-view consistency loss measures the discrepancy between the two sets of extracted keypoints under the ground-truth transform. A relative-pose estimation loss penalizes the angular discrepency (under orthogonal procrustes) of the estimated transform using the extracted keypoints vs the GT transform. Additionally, they require keypoints to be distant from each other, and to lie within the object silhouette.