arielreplicate / gscorecam-clip-analyzer

Shows what CLIP looks at in an image given text

  • Public
  • 703 runs
  • GitHub
  • Paper
  • License



Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 24 seconds. The predict time for this model varies significantly based on the inputs.


gScoreCAM: What is CLIP looking at?

Show which parts of an image are most correlated with a text juged by CLIP embedded space.

tldr: Based on the observations that CLIP ResNet-50 channels are very noisy compared to typical ImageNet-trained ResNet-50, and most saliency methods obtain pretty low object localization scores with CLIP. By visualizing the top 10% most sensitive (highest-gradient) channels, our gScoreCAM obtains the state of the art weakly supervised localization results using CLIP (in both ResNet and ViT versions).

Official Implementation for the paper gScoreCAM: What is CLIP looking at? (2022) by Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen. Oral paper at ACCV 2022.

If you use this software, please consider citing:

  title={gScoreCAM: What is CLIP looking at?},
  author={Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen},
  booktitle={Proceedings of the Asian Conference on Computer Vision (ACCV)},

:star2: Interactive Colab demo :star2: