This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 24 seconds. The predict time for this model varies significantly based on the inputs.


gScoreCAM: What is CLIP looking at?

Show which parts of an image are most correlated with a text juged by CLIP embedded space.

tldr: Based on the observations that CLIP ResNet-50 channels are very noisy compared to typical ImageNet-trained ResNet-50, and most saliency methods obtain pretty low object localization scores with CLIP. By visualizing the top 10% most sensitive (highest-gradient) channels, our gScoreCAM obtains the state of the art weakly supervised localization results using CLIP (in both ResNet and ViT versions).

Official Implementation for the paper gScoreCAM: What is CLIP looking at? (2022) by Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen. Oral paper at ACCV 2022.

