Image to text
Models that generate text prompts and captions from images
Bootstrapping Language-Image Pre-training
Get an approximate text prompt, with style, matching an image. (Optimized for stable-diffusion (clip ViT-L/14))
The CLIP Interrogator is a prompt engineering tool that combines OpenAI's CLIP and Salesforce's BLIP to optimize text prompts to match a given image. Use the resulting prompts with text-to-image models like Stable Diffusion to create cool art!
Simple image captioning model using CLIP and GPT-2
Fine-grained Image Captioning with CLIP Reward
Image captioning with VirTex