adirik/owlvit-base-patch32 | Run with an API on Replicate

Input

Run this model in Node.js with one line of code:

npx create-replicate --model=adirik/owlvit-base-patch32

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run adirik/owlvit-base-patch32 using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "adirik/owlvit-base-patch32:5e899f155a1913c4b7304d09082d842ca7fe6cb1f22e066c83eb1d7849dc37c2",
  {
    input: {
      image: "https://replicate.delivery/pbxt/JhlycB8ScNVrMu0ke1Xlg09ajbsmMfp4TK19JXpnYq6GrHK8/astronaut.png",
      query: "human face, rocket, star-spangled banner, nasa badge",
      threshold: 0.11,
      show_visualisation: true
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run adirik/owlvit-base-patch32 using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "adirik/owlvit-base-patch32:5e899f155a1913c4b7304d09082d842ca7fe6cb1f22e066c83eb1d7849dc37c2",
    input={
        "image": "https://replicate.delivery/pbxt/JhlycB8ScNVrMu0ke1Xlg09ajbsmMfp4TK19JXpnYq6GrHK8/astronaut.png",
        "query": "human face, rocket, star-spangled banner, nasa badge",
        "threshold": 0.11,
        "show_visualisation": True
    }
)
print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run adirik/owlvit-base-patch32 using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "adirik/owlvit-base-patch32:5e899f155a1913c4b7304d09082d842ca7fe6cb1f22e066c83eb1d7849dc37c2",
    "input": {
      "image": "https://replicate.delivery/pbxt/JhlycB8ScNVrMu0ke1Xlg09ajbsmMfp4TK19JXpnYq6GrHK8/astronaut.png",
      "query": "human face, rocket, star-spangled banner, nasa badge",
      "threshold": 0.11,
      "show_visualisation": true
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

json_data

{ "objects": [ { "bbox": [ 180, 71, 271, 178 ], "label": "human face", "confidence": 0.35713595151901245 }, { "bbox": [ 1, 1, 105, 509 ], "label": "star-spangled banner", "confidence": 0.13790424168109894 }, { "bbox": [ 350, -1, 468, 288 ], "label": "rocket", "confidence": 0.2110234647989273 }, { "bbox": [ 129, 348, 206, 427 ], "label": "nasa badge", "confidence": 0.28099769353866577 }, { "bbox": [ 277, 338, 327, 380 ], "label": "nasa badge", "confidence": 0.1195005401968956 } ] }

result_image

{
  "completed_at": "2023-10-15T17:54:22.524691Z",
  "created_at": "2023-10-15T17:50:51.320269Z",
  "data_removed": false,
  "error": null,
  "id": "2viboyrc3hn5eelepidmqfzm6i",
  "input": {
    "image": "https://replicate.delivery/pbxt/JhlycB8ScNVrMu0ke1Xlg09ajbsmMfp4TK19JXpnYq6GrHK8/astronaut.png",
    "query": "human face, rocket, star-spangled banner, nasa badge",
    "threshold": 0.11,
    "show_visualisation": true
  },
  "logs": "human face, rocket, star-spangled banner, nasa badge True\n/root/.pyenv/versions/3.9.18/lib/python3.9/site-packages/transformers/models/owlvit/image_processing_owlvit.py:429: FutureWarning: `post_process` is deprecated and will be removed in v5 of Transformers, please use `post_process_object_detection` instead, with `threshold=0.` for equivalent results.\nwarnings.warn(",
  "metrics": {
    "predict_time": 5.496842,
    "total_time": 211.204422
  },
  "output": {
    "json_data": {
      "objects": [
        {
          "bbox": [
            180,
            71,
            271,
            178
          ],
          "label": "human face",
          "confidence": 0.35713595151901245
        },
        {
          "bbox": [
            1,
            1,
            105,
            509
          ],
          "label": "star-spangled banner",
          "confidence": 0.13790424168109894
        },
        {
          "bbox": [
            350,
            -1,
            468,
            288
          ],
          "label": "rocket",
          "confidence": 0.2110234647989273
        },
        {
          "bbox": [
            129,
            348,
            206,
            427
          ],
          "label": "nasa badge",
          "confidence": 0.28099769353866577
        },
        {
          "bbox": [
            277,
            338,
            327,
            380
          ],
          "label": "nasa badge",
          "confidence": 0.1195005401968956
        }
      ]
    },
    "result_image": "https://replicate.delivery/pbxt/oO5rHoHwsrYGJh5HeElqpBBmjoi1gkXxGofpiQuxMvDNlduRA/result.png"
  },
  "started_at": "2023-10-15T17:54:17.027849Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/2viboyrc3hn5eelepidmqfzm6i",
    "cancel": "https://api.replicate.com/v1/predictions/2viboyrc3hn5eelepidmqfzm6i/cancel"
  },
  "version": "5e899f155a1913c4b7304d09082d842ca7fe6cb1f22e066c83eb1d7849dc37c2"
}

Generated in

5.5 seconds

Tweak it Report View full prediction

Examples

View more examples

Run time and cost

This model costs approximately $0.020 to run on Replicate, or 50 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 87 seconds. The predict time for this model varies significantly based on the inputs.

Readme

OWL-ViT

OWL-ViT uses CLIP and vision transformers backbones to enable open-vocabulary object detection. See the paper, original repository and Hugging Face implementation for details.

Using the API

You can use OWL-ViT to query images with text descriptions of any object. To use it, simply upload an image and enter comma separated text descriptions of objects you want to query the image for. You can also use the score threshold slider to set a threshold to filter out low probability predictions.

OWL-ViT is trained on text templates, hence you can get better predictions by querying the image with text templates used in training the original model: “photo of a star-spangled banner”, “image of a shoe”. Refer to the CLIP paper to see the full list of text templates used to augment the training data.

References

@article{minderer2022simple,
  title={Simple Open-Vocabulary Object Detection with Vision Transformers},
  author={Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby},
  journal={ECCV},
  year={2022},
}