arielreplicate / gscorecam-clip-analyzer

Shows what CLIP looks at in an image given text

  • Public
  • 738 runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.0054 to run on Replicate, or 185 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 24 seconds. The predict time for this model varies significantly based on the inputs.

Readme

gScoreCAM: What is CLIP looking at?

Show which parts of an image are most correlated with a text juged by CLIP embedded space.

tldr: Based on the observations that CLIP ResNet-50 channels are very noisy compared to typical ImageNet-trained ResNet-50, and most saliency methods obtain pretty low object localization scores with CLIP. By visualizing the top 10% most sensitive (highest-gradient) channels, our gScoreCAM obtains the state of the art weakly supervised localization results using CLIP (in both ResNet and ViT versions).

Official Implementation for the paper gScoreCAM: What is CLIP looking at? (2022) by Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen. Oral paper at ACCV 2022.

If you use this software, please consider citing:

@inproceedings{chen2022gScoreCAM,
  title={gScoreCAM: What is CLIP looking at?},
  author={Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen},
  booktitle={Proceedings of the Asian Conference on Computer Vision (ACCV)},
  year={2022}
}

:star2: Interactive Colab demo :star2: