gScoreCAM: What is CLIP looking at?
Show which parts of an image are most correlated with a text juged by CLIP embedded space.
tldr: Based on the observations that CLIP ResNet-50 channels are very noisy compared to typical ImageNet-trained ResNet-50, and most saliency methods obtain pretty low object localization scores with CLIP. By visualizing the top 10% most sensitive (highest-gradient) channels, our gScoreCAM obtains the state of the art weakly supervised localization results using CLIP (in both ResNet and ViT versions).
Official Implementation for the paper gScoreCAM: What is CLIP looking at? (2022) by Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen. Oral paper at ACCV 2022.
If you use this software, please consider citing:
@inproceedings{chen2022gScoreCAM,
title={gScoreCAM: What is CLIP looking at?},
author={Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen},
booktitle={Proceedings of the Asian Conference on Computer Vision (ACCV)},
year={2022}
}
:star2: Interactive Colab demo :star2: