Open-Vocabulary DETR with Conditional Matching
Yuhang Zang
and
Wei Li
and
Kaiyang Zhou
and
Chen Huang
and
Chen Change Loy
arXiv e-Print archive - 2022 via Local arXiv
Keywords:
cs.CV, cs.AI
First published: 2024/11/21 (just now) Abstract: Open-vocabulary object detection, which is concerned with the problem of
detecting novel objects guided by natural language, has gained increasing
attention from the community. Ideally, we would like to extend an
open-vocabulary detector such that it can produce bounding box predictions
based on user inputs in form of either natural language or exemplar image. This
offers great flexibility and user experience for human-computer interaction. To
this end, we propose a novel open-vocabulary detector based on DETR -- hence
the name OV-DETR -- which, once trained, can detect any object given its class
name or an exemplar image. The biggest challenge of turning DETR into an
open-vocabulary detector is that it is impossible to calculate the
classification cost matrix of novel classes without access to their labeled
images. To overcome this challenge, we formulate the learning objective as a
binary matching one between input queries (class name or exemplar image) and
the corresponding objects, which learns useful correspondence to generalize to
unseen queries during testing. For training, we choose to condition the
Transformer decoder on the input embeddings obtained from a pre-trained
vision-language model like CLIP, in order to enable matching for both text and
image queries. With extensive experiments on LVIS and COCO datasets, we
demonstrate that our OV-DETR -- the first end-to-end Transformer-based
open-vocabulary detector -- achieves non-trivial improvements over current
state of the arts.