Collections

Caption Images

These models generate text descriptions and captions from images. They use large multimodal transformers trained on image-text pairs to understand visual concepts.

Key capabilities:

  • Image captioning: Produce relevant captions summarizing image contents and context. Useful for indexing images and accessibility. Automate alt text for images.
  • Visual question answering: Generate natural language answers to questions about images. Ask questions about your images.
  • Text prompt generation: Create prompts matching image style and content. Use images to guide text-to-image generation.

Our pick: Moondream 2B

Moondream is an efficient, versatile vision language model. It offers a great balance of intelligence to cost, and it can give a detailed caption in just seconds.

A more powerful model: LLaVa 13B

For most people, we recommend the LLaVa 13B model. LLaVa can generate full paragraphs describing an image in depth. It also excels at answering questions about images insightfully.

Budget pick: BLIP

If you need to generate a large volume of image captions or answers and don't require maximum detail or intelligence, BLIP is a great choice. It performs nearly as well as the more advanced but slower BLIP-2, which makes it significantly cheaper per request

However, BLIP is less capable than Moondream or LLaVa at generating long-form text or exhibiting deeper visual understanding. Stick with our top pick if you need those advanced capabilities.

Frequently asked questions

Which models are the fastest for generating text from images?

If you need quick captions, lucataco/moondream2 is one of the speedier models in the image-to-text collection. It’s optimized to produce short, relevant descriptions without long processing times.
Faster captioning models are best when you need basic descriptions or alt text at scale.

Which models provide the best balance of detail and usability?

yorickvp/llava-13b is a strong option when you need richer, more descriptive outputs. It can handle both simple captioning and more complex visual question answering (VQA), like identifying actions or objects in a scene.
If your goal is accessibility, search indexing, or descriptive tags, lucataco/moondream2 gives you good coverage without long waits.

What works best for accessibility, alt text, or tagging images?

For straightforward image descriptions—like alt text, SEO tags, or catalog metadata—lucataco/moondream2 is a great fit. It generates clear, concise captions that describe what’s in an image.
If you need more context or nuance in those descriptions, switch to a more expressive model like yorickvp/llava-13b.

What should I use if I want to ask questions about an image?

For interactive use cases—like asking “What is this person doing?” or “How many people are here?”—pick a model that supports VQA (visual question answering), such as yorickvp/llava-13b.
These models let you pass both an image and a text question to get a natural language answer.

How do the main types of models in this collection differ?

  • Basic captioning: Fast and simple. Generates a one-line description or tags for an image.
  • Captioning + VQA: Accepts both an image and an optional question, returning more detailed answers.
  • Lightweight vs richer models: Lightweight captioning models are good for scale and speed; larger models like yorickvp/llava-13b provide more detailed reasoning about image content.

What kinds of outputs can I expect from image-to-text models?

Depending on the model, you may get:

  • A short text caption summarizing the image.
  • A set of tags or keywords describing objects or concepts.
  • A full-sentence answer to a question about the image.
    Some models give brief outputs, while others can generate paragraph-length text.

How can I self-host or publish my own image-to-text model?

You can package your own image captioning or VQA model with Cog and push it to Replicate. Define your inputs (image and optional question) and outputs (text caption or answer), and set your versioning and sharing settings.
This gives you control over how the model runs and is shared.

Can I use these models for commercial work?

Many models in the image-to-text collection support commercial use, but licenses vary. Always check the model card for attribution requirements or restrictions before using outputs in production.

How do I use image-to-text models on Replicate?

  1. Pick a model from the image-to-text collection.
  2. Upload an image or provide a URL.
  3. (Optional) Add a text question if the model supports VQA.
  4. Run the model and get your caption or answer.
  5. Use the text output in your product, accessibility layers, or workflows.

What should I keep in mind when working with image-to-text models?

  • Clear, high-quality images lead to better captions.
  • Use fast models for bulk captioning and richer models when context matters.
  • Be specific when asking questions for VQA.
  • If consistency matters (e.g., for SEO or datasets), test and review outputs before scaling.
  • Outputs are text only—if you need audio, you can pair these captions with a text-to-speech model.