Collections

Detect objects

These models distinguish objects in images and videos. You can use them to detect which things are in a scene, what they are and where they’re located. You can also cut objects out from the scene, or create masks for inpainting and other tasks.

Best for detecting objects in images: adirik/grounding-dino

To find specific things in an image, we recommend adirik/grounding-dino. You can input any number of text labels and get back bounding boxes for each of the objects you’re looking for. It’s cheap and takes less than a second to run.

Best for detecting objects in videos: zsxkib/yolo-world

Use this model to find and track things in videos from text labels. You’ll get back bounding boxes for each object by frame.

You can also use zsxkib/yolo-world for images. It’s similar in performance to the above, but sometimes one or the other will work better for a given use case.

Best for segmentation: meta/sam-2 and meta/sam-2-video

Meta’s Segment Anything Model is a great way to extract things from images and videos, or to create masks for inpainting. They require a little more preparation than the bounding box models: you’ll need to send the coordinates of click points for the objects you want to segment.

If you want to segment objects with text labels, try schananas/grounded_sam. Send a text prompt with object names and you’ll get back a mask for the collection of objects you’ve described.

Best for tracking objects in videos: zsxkib/samurai

Input a video and the coordinates for an object, and this specialized version of SAM will track the object across frames.

Best for labeling whole scenes: cjwbw/semantic-segment-anything

This model will label every pixel in an image with a class. It’s great for creating training data and creating masks for inpainting.

Recommended models