idea-research / ram-grounded-sam

A Strong Image Tagging Model with Segment Anything

  • Public
  • 1.5M runs
  • A100 (80GB)
  • GitHub
  • Paper
  • License

Input

input_image
*file

Input image

boolean

Use sam_hq instead of SAM for prediction

Default: false

boolean

Output rounding box and masks on the image

Default: false

Output

tags

armchair, blanket, lamp, carpet, couch, dog, floor, furniture, gray, green, living room, picture frame, pillow, plant, room, sit, stool, wood floor

json_data

{ "mask": [ { "label": "background", "value": 0 }, { "box": [ 1590.994384765625, 1150.455810546875, 2062.0126953125, 1525.410400390625 ], "label": "dog", "logit": 0.61, "value": 1 }, { "box": [ 320.5067138671875, 1872.4437255859375, 2564.30126953125, 2149.60693359375 ], "label": "carpet", "logit": 0.51, "value": 2 }, { "box": [ 2526.239013671875, 905.7824096679688, 2951.733154296875, 1446.57421875 ], "label": "lamp", "logit": 0.49, "value": 3 }, { "box": [ 1488.95751953125, 1134.001953125, 2101.83740234375, 1869.38525390625 ], "label": "blanket", "logit": 0.49, "value": 4 }, { "box": [ 1649.08935546875, 462.552734375, 2173.29248046875, 844.016357421875 ], "label": "picture frame", "logit": 0.46, "value": 5 }, { "box": [ 7.120361328125, 9.4737548828125, 2988.807861328125, 2143.8388671875 ], "label": "living room room", "logit": 0.45, "value": 6 }, { "box": [ 5.122314453125, 1560.66943359375, 2993.923828125, 2147.88232421875 ], "label": "floor wood floor", "logit": 0.44, "value": 7 }, { "box": [ 21.509536743164062, 1283.55908203125, 493.4183349609375, 1861.4951171875 ], "label": "plant", "logit": 0.43, "value": 8 }, { "box": [ 318.203369140625, 746.8875732421875, 874.658447265625, 1167.321044921875 ], "label": "plant", "logit": 0.4, "value": 9 }, { "box": [ 2223.62109375, 196.16336059570312, 2685.33837890625, 838.8326416015625 ], "label": "picture frame", "logit": 0.38, "value": 10 }, { "box": [ 729.8843383789062, 1130.2867431640625, 2381.857666015625, 1854.0660400390625 ], "label": "armchair couch", "logit": 0.38, "value": 11 }, { "box": [ 987.6152954101562, 1520.15966796875, 1631.835693359375, 2110.7431640625 ], "label": "furniture stool", "logit": 0.37, "value": 12 }, { "box": [ 1056.6025390625, 1164.431396484375, 1398.310791015625, 1448.27099609375 ], "label": "pillow", "logit": 0.35, "value": 13 }, { "box": [ 2108.94873046875, 1920.5863037109375, 2995.509033203125, 2149.590087890625 ], "label": "stool", "logit": 0.31, "value": 14 }, { "box": [ 1787.185302734375, 1134.058837890625, 2167.47900390625, 1448.926025390625 ], "label": "pillow", "logit": 0.31, "value": 15 }, { "box": [ 899.2510986328125, 1112.372802734375, 1249.1456298828125, 1453.3167724609375 ], "label": "pillow", "logit": 0.3, "value": 16 }, { "box": [ 471.819580078125, 1143.072998046875, 961.84228515625, 1664.814697265625 ], "label": "furniture", "logit": 0.27, "value": 17 }, { "box": [ 2467.264892578125, 1397.0821533203125, 2937.561279296875, 1785.8111572265625 ], "label": "furniture", "logit": 0.26, "value": 18 } ], "tags": "armchair, blanket, lamp, carpet, couch, dog, floor, furniture, gray, green, living room, picture frame, pillow, plant, room, sit, stool, wood floor" }
Generated in

This example was created by a different version, idea-research/ram-grounded-sam:47c4f1c7.

Run time and cost

This model costs approximately $0.075 to run on Replicate, or 13 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 54 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Recognize Anything with Grounded-Segment-Anything

Recognize Anything Model (RAM) is an image tagging model, which can recognize any common category with high accuracy.

Highlight of RAM

RAM is a strong image tagging model, which can recognize any common category with high accuracy. - Strong and general. RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization; - RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP. - RAM even surpasses the fully supervised manners (ML-Decoder). - RAM exhibits competitive performance with the Google tagging API. - Reproducible and affordable. RAM requires Low reproduction cost with open-source and annotation-free dataset; - Flexible and versatile. RAM offers remarkable flexibility, catering to various application scenarios.

RAM significantly improves the tagging ability based on the Tag2text framework. - Accuracy. RAM utilizes a data engine to generate additional annotations and clean incorrect ones, higher accuracy compared to Tag2Text. - Scope. RAM upgrades the number of fixed tags from 3,400+ to 6,400+ (synonymous reduction to 4,500+ different semantic tags), covering more valuable categories.

Citation

If you find our work to be useful for your research, please consider citing.

@article{zhang2023recognize,
  title={Recognize Anything: A Strong Image Tagging Model},
  author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
  journal={arXiv preprint arXiv:2306.03514},
  year={2023}
}

@article{liu2023grounding,
  title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
  author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
  journal={arXiv preprint arXiv:2303.05499},
  year={2023}
}