Caption images

These models generate text descriptions and captions from images. They use large multimodal transformers trained on image-text pairs to understand visual concepts.

Key capabilities:

  • Image captioning: Produce relevant captions summarizing image contents and context. Useful for indexing images and accessibility. Automate alt text for images.
  • Visual question answering: Generate natural language answers to questions about images. Ask questions about your images.
  • Text prompt generation: Create prompts matching image style and content. Use images to guide text-to-image generation.

Our Pick: LLaVa 13B

For most people, we recommend the LLaVa 13B model. It provides the best balance of intelligence, detail, and cost-effectiveness. LLaVa can generate full paragraphs describing an image in depth. It also excels at answering questions about images insightfully.

In particular, LLaVa outperforms models like img2prompt and CLIP Interrogator for generating detailed text-to-image prompts from a source image. Querying LLaVa with a prompt like “Describe the image, including content, medium, creator, styles, and other keywords” produces great prompts around 4 times faster and 100 times cheaper than img2prompt.

Budget Pick: BLIP

If you need to generate a large volume of image captions or answers and don’t require maximum detail or intelligence, BLIP is a great choice. It performs nearly as well as the more advanced but slower BLIP-2, which makes it significantly cheaper per request

However, BLIP is less capable than LLaVa at generating long-form text or exhibiting deeper visual understanding. Stick with our top pick if you need those advanced capabilities.