Detect objects
These models distinguish objects in images and videos. You can use them to detect which things are in a scene, what they are and where they’re located. You can also cut objects out from the scene, or create masks for inpainting and other tasks.
Best for detecting objects in images: adirik/grounding-dino
To find specific things in an image, we recommend adirik/grounding-dino. You can input any number of text labels and get back bounding boxes for each of the objects you’re looking for. It’s cheap and takes less than a second to run.
Best for detecting objects in videos: zsxkib/yolo-world
Use this model to find and track things in videos from text labels. You’ll get back bounding boxes for each object by frame.
You can also use zsxkib/yolo-world for images. It’s similar in performance to the above, but sometimes one or the other will work better for a given use case.
Best for segmentation: meta/sam-2 and meta/sam-2-video
Meta’s Segment Anything Model is a great way to extract things from images and videos, or to create masks for inpainting. They require a little more preparation than the bounding box models: you’ll need to send the coordinates of click points for the objects you want to segment.
If you want to segment objects with text labels, try schananas/grounded_sam. Send a text prompt with object names and you’ll get back a mask for the collection of objects you’ve described.
Best for tracking objects in videos: zsxkib/samurai
Input a video and the coordinates for an object, and this specialized version of SAM will track the object across frames.
Best for labeling whole scenes: cjwbw/semantic-segment-anything
This model will label every pixel in an image with a class. It’s great for creating training data and creating masks for inpainting.
Featured models
zsxkib / samurai
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
meta / sam-2-video
SAM 2: Segment Anything v2 (for videos)
meta / sam-2
SAM 2: Segment Anything v2 (for Images)
zsxkib / yolo-world
Real-Time Open-Vocabulary Object Detection
schananas / grounded_sam
Mask prompting based on Grounding DINO & Segment Anything | Integral cog of doiwear.it
adirik / grounding-dino
Detect everything with language!
cjwbw / semantic-segment-anything
Adding semantic labels for segment anything
Recommended models
jweek / mask_maker
Uses DINO to detect regions and further refines them with SAM. Returns masking data as RLE encoded JSON.
lucataco / florence-2-large
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
ahmdyassr / mask-clothing
Super fast clothing (and face) segmentation and masking with erosion and dilation capability.
hadilq / hair-segment
This is an ML model to segment hairs in pictures.
swook / inspyrenet
Segment foreground objects with high resolution and matting, using InSPyReNet
falcons-ai / nsfw_image_detection
Fine-Tuned Vision Transformer (ViT) for NSFW Image Classification
chigozienri / mediapipe-face
batch or individual face detection with mediapipe
adirik / owlvit-base-patch32
Zero-shot / open vocabulary object detection
hassamdevsy / mask2former
Facebook Mask2Former trained on ADE 20k Dataset
idea-research / ram-grounded-sam
A Strong Image Tagging Model with Segment Anything
naklecha / clothing-segmentation
This model can detect clothing using a custom state of the art clothing segmentation algorithm.
daanelson / yolox
High performance and lightweight object detection models