Detects any class given class names

  • Public
  • 26.2K runs

Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 38 seconds. The predict time for this model varies significantly based on the inputs.


Detecting Twenty-thousand Classes using Image-level Supervision

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
arXiv technical report (arXiv 2201.02605)


-Detects any class given class names (using CLIP).

-We train the detector on ImageNet-21K dataset with 21K classes.

-Cross-dataset generalization to OpenImages and Objects365 without finetuning.

-State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.

-Works for DETR-style detectors.


The majority of Detic is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms: SWIN-Transformer, CLIP, and TensorFlow Object Detection API are licensed under the MIT license; UniDet is licensed under the Apache 2.0 license; and the LVIS API is licensed under a custom license (” If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0

Ethical Considerations

Detic’s wide range of detection capabilities may introduce similar challenges to many other visual recognition and open-set recognition methods. As the user can define arbitrary detection classes, class design and semantics may impact the model output.


If you find this project useful for your research, please use the following BibTeX entry.

  title={Detecting Twenty-thousand Classes using Image-level Supervision},
  author={Zhou, Xingyi and Girdhar, Rohit and Joulin, Armand and Kr{\"a}henb{\"u}hl, Philipp and Misra, Ishan},
  booktitle={arXiv preprint arXiv:2201.02605},