j-min / clip-caption-reward

Fine-grained Image Captioning with CLIP Reward

  • Public
  • 295.5K runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 7 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Fine-grained Image Captioning with CLIP Reward

teaser image

Acknowledgments

We thank the developers of CLIP-ViL, ImageCaptioning.pytorch, CLIP, coco-caption, cider for their public code release.

Reference

Please cite our paper if you use our models in your works:

```bibtex @inproceedings{Cho2022CLIPReward, title = {Fine-grained Image Captioning with CLIP Reward}, author = {Jaemin Cho and Seunghyun Yoon and Ajinkya Kale and Franck Dernoncourt and Trung Bui and Mohit Bansal}, booktitle = {Findings of NAACL}, year = {2022} }