These models distinguish objects in images and videos. You can use them to detect which things are in a scene, what they are and where they're located. You can also cut objects out from the scene, or create masks for inpainting and other tasks.
To find specific things in an image, we recommend adirik/grounding-dino. You can input any number of text labels and get back bounding boxes for each of the objects you're looking for. It's cheap and takes less than a second to run.
Use this model to find and track things in videos from text labels. You'll get back bounding boxes for each object by frame.
You can also use zsxkib/yolo-world for images. It's similar in performance to the above, but sometimes one or the other will work better for a given use case.
Meta's Segment Anything Model is a great way to extract things from images and videos, or to create masks for inpainting. They require a little more preparation than the bounding box models: you'll need to send the coordinates of click points for the objects you want to segment.
If you want to segment objects with text labels, try schananas/grounded_sam. Send a text prompt with object names and you'll get back a mask for the collection of objects you've described.
Input a video and the coordinates for an object, and this specialized version of SAM will track the object across frames.
This model will label every pixel in an image with a class. It's great for creating training data and creating masks for inpainting.
Featured models


zsxkib/samurai
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
Updated 11 months, 2 weeks ago
229 runs


meta/sam-2-video
SAM 2: Segment Anything v2 (for videos)
Updated 1 year, 3 months ago
48.4K runs


meta/sam-2
SAM 2: Segment Anything v2 (for Images)
Updated 1 year, 3 months ago
29.4K runs


zsxkib/yolo-world
Real-Time Open-Vocabulary Object Detection
Updated 1 year, 9 months ago
12.4K runs


schananas/grounded_sam
Mask prompting based on Grounding DINO & Segment Anything | Integral cog of doiwear.it
Updated 2 years ago
865.5K runs


adirik/grounding-dino
Detect everything with language!
Updated 2 years ago
20.6M runs


cjwbw/semantic-segment-anything
Adding semantic labels for segment anything
Updated 2 years, 7 months ago
36.2K runs
Recommended Models
If you need low-latency detection, adirik/grounding-dino is one of the fastest models in the object detection & segmentation collection. It’s designed for quick, open-vocabulary detection — you can pass in text labels like “dog,” “bicycle,” or “traffic light,” and it returns bounding boxes in roughly a second for most images.
Fast models work well for simple scenes, but they may be less precise in crowded or complex images.
For advanced use cases that require more control or detail, meta/sam-2 (for images) and zsxkib/yolo-world (for videos) are strong choices.
When your task is to detect particular objects from text labels — for example, “find the person and the umbrella” — adirik/grounding-dino is built exactly for that. It uses open-vocabulary detection, meaning you can describe any object with text, not just a fixed list of categories.
It’s particularly good for images with clear subjects and minimal occlusion.
If you need to follow objects over time — such as people, vehicles, or sports equipment — zsxkib/yolo-world or zsxkib/samurai are well suited.
These models provide object detection and tracking across multiple frames, maintaining consistent IDs or masks as objects move.
Depending on the model, you may get:
Some segmentation models, like meta/sam-2, may require you to provide click points or coordinates to specify which regions to segment.
You can package your own model (for example, a fine-tuned version of YOLO or SAM) with Cog and push it to Replicate. This allows you to define your own input structure, such as image or video plus text prompts, and control how it’s shared and used.
Many models in the object detection & segmentation collection allow commercial use, but license terms vary. Check each model’s license page for attribution or usage restrictions before deploying in production or commercial environments.
Recommended Models


jweek/mask_maker
Uses DINO to detect regions and further refines them with SAM. Returns masking data as RLE encoded JSON.
Updated 4 months, 3 weeks ago
508 runs


lucataco/florence-2-large
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Updated 1 year, 4 months ago
513.9K runs


ahmdyassr/mask-clothing
Super fast clothing (and face) segmentation and masking with erosion and dilation capability, made for https://outfit.fm
Updated 1 year, 5 months ago
39.7K runs


hadilq/hair-segment
This is an ML model to segment hairs in pictures.
Updated 1 year, 5 months ago
516 runs


swook/inspyrenet
Segment foreground objects with high resolution and matting, using InSPyReNet
Updated 1 year, 5 months ago
695.2K runs


falcons-ai/nsfw_image_detection
Fine-Tuned Vision Transformer (ViT) for NSFW Image Classification
Updated 1 year, 11 months ago
65.8M runs


chigozienri/mediapipe-face
batch or individual face detection with mediapipe
Updated 2 years ago
95.2K runs


adirik/owlvit-base-patch32
Zero-shot / open vocabulary object detection
Updated 2 years, 1 month ago
24.5K runs


hassamdevsy/mask2former
Facebook Mask2Former trained on ADE 20k Dataset
Updated 2 years, 4 months ago
58.2K runs


idea-research/ram-grounded-sam
A Strong Image Tagging Model with Segment Anything
Updated 2 years, 4 months ago
1.5M runs


naklecha/clothing-segmentation
This model can detect clothing using a custom state of the art clothing segmentation algorithm.
Updated 2 years, 5 months ago
4.1K runs


daanelson/yolox
High performance and lightweight object detection models
Updated 2 years, 9 months ago
102.3K runs