fofr / batch-image-captioning

A wrapper model for captioning multiple images using GPT, Claude or Gemini, useful for lora training

Cold

Public
1.4K runs
CPU
GitHub
License

Iterate in playground

Run with an API

Playground API Examples README Versions

Input

image_zip_archive

*file

ZIP archive containing images to process

caption_prefix

string

Shift + Return to add a new line

Optional prefix for image captions

Default: ""

caption_suffix

string

Shift + Return to add a new line

Optional suffix for image captions

Default: ""

resize_images_for_captioning

boolean

Whether to resize images for captioning. This makes captioning cheaper

Default: true

max_dimension

integer

Maximum dimension (width or height) for resized images

Default: 1024

model

string

AI model to use for captioning. Your OpenAI or Anthropic account will be charged for usage, see their pricing pages for details.

Default: "gpt-4o-2024-08-06"

openai_api_key

secret

API key for OpenAI

anthropic_api_key

secret

API key for Anthropic

google_generativeai_api_key

secret

API key for Google Generative AI

system_prompt

string

Shift + Return to add a new line

Write a four sentence caption for this image. In the first sentence describe the style and type (painting, photo, etc) of the image. Describe in the remaining sentences the contents and composition of the image. Only use language that would be used to prompt a text to image model. Do not include usage. Comma separate keywords rather than using "or". Precise composition is important. Avoid phrases like "conveys a sense of" and "capturing the", just use the terms themselves.

Good examples are:

"Photo of an alien woman with a glowing halo standing on top of a mountain, wearing a white robe and silver mask in the futuristic style with futuristic design, sky background, soft lighting, dynamic pose, a sense of future technology, a science fiction movie scene rendered in the Unreal Engine."

"A scene from the cartoon series Masters of the Universe depicts Man-At-Arms wearing a gray helmet and gray armor with red gloves. He is holding an iron bar above his head while looking down on Orko, a pink blob character. Orko is sitting behind Man-At-Arms facing left on a chair. Both characters are standing near each other, with Orko inside a yellow chestplate over a blue shirt and black pants. The scene is drawn in the style of the Masters of the Universe cartoon series."

"An emoji, digital illustration, playful, whimsical. A cartoon zombie character with green skin and tattered clothes reaches forward with two hands, they have green skin, messy hair, an open mouth and gaping teeth, one eye is half closed."
Write a four sentence caption for this image. In the first sentence describe the style and type (painting, photo, etc) of the image. Describe in the remaining sentences the contents and composition of the image. Only use language that would be used to prompt a text to image model. Do not include usage. Comma separate keywords rather than using "or". Precise composition is important. Avoid phrases like "conveys a sense of" and "capturing the", just use the terms themselves.

Good examples are:

"Photo of an alien woman with a glowing halo standing on top of a mountain, wearing a white robe and silver mask in the futuristic style with futuristic design, sky background, soft lighting, dynamic pose, a sense of future technology, a science fiction movie scene rendered in the Unreal Engine."

"A scene from the cartoon series Masters of the Universe depicts Man-At-Arms wearing a gray helmet and gray armor with red gloves. He is holding an iron bar above his head while looking down on Orko, a pink blob character. Orko is sitting behind Man-At-Arms facing left on a chair. Both characters are standing near each other, with Orko inside a yellow chestplate over a blue shirt and black pants. The scene is drawn in the style of the Masters of the Universe cartoon series."

"An emoji, digital illustration, playful, whimsical. A cartoon zombie character with green skin and tattered clothes reaches forward with two hands, they have green skin, messy hair, an open mouth and gaping teeth, one eye is half closed."

System prompt for image analysis

Default: "\nWrite a four sentence caption for this image. In the first sentence describe the style and type (painting, photo, etc) of the image. Describe in the remaining sentences the contents and composition of the image. Only use language that would be used to prompt a text to image model. Do not include usage. Comma separate keywords rather than using \"or\". Precise composition is important. Avoid phrases like \"conveys a sense of\" and \"capturing the\", just use the terms themselves.\n\nGood examples are:\n\n\"Photo of an alien woman with a glowing halo standing on top of a mountain, wearing a white robe and silver mask in the futuristic style with futuristic design, sky background, soft lighting, dynamic pose, a sense of future technology, a science fiction movie scene rendered in the Unreal Engine.\"\n\n\"A scene from the cartoon series Masters of the Universe depicts Man-At-Arms wearing a gray helmet and gray armor with red gloves. He is holding an iron bar above his head while looking down on Orko, a pink blob character. Orko is sitting behind Man-At-Arms facing left on a chair. Both characters are standing near each other, with Orko inside a yellow chestplate over a blue shirt and black pants. The scene is drawn in the style of the Masters of the Universe cartoon series.\"\n\n\"An emoji, digital illustration, playful, whimsical. A cartoon zombie character with green skin and tattered clothes reaches forward with two hands, they have green skin, messy hair, an open mouth and gaping teeth, one eye is half closed.\"\n"

message_prompt

string

Shift + Return to add a new line

Message prompt for image captioning

Default: "Caption this image please"

Run this model in Node.js with one line of code:

npx create-replicate --model=fofr/batch-image-captioning

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";
import fs from "node:fs";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run fofr/batch-image-captioning using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "fofr/batch-image-captioning:d0adb15f4826881a68f1d82e0b10fe2ee1af536632dc8313f7f777ed8d264726",
  {
    input: {
      model: "gpt-4o-2024-08-06",
      max_dimension: 1024,
      system_prompt: "Write a four sentence caption for this image. In the first sentence describe the style and type (painting, photo, etc) of the image. Describe in the remaining sentences the contents and composition of the image. Only use language that would be used to prompt a text to image model. Do not include usage. Comma separate keywords rather than using \"or\". Precise composition is important. Avoid phrases like \"conveys a sense of\" and \"capturing the\", just use the terms themselves.\n\nGood examples are:\n\n\"Photo of an alien woman with a glowing halo standing on top of a mountain, wearing a white robe and silver mask in the futuristic style with futuristic design, sky background, soft lighting, dynamic pose, a sense of future technology, a science fiction movie scene rendered in the Unreal Engine.\"\n\n\"A scene from the cartoon series Masters of the Universe depicts Man-At-Arms wearing a gray helmet and gray armor with red gloves. He is holding an iron bar above his head while looking down on Orko, a pink blob character. Orko is sitting behind Man-At-Arms facing left on a chair. Both characters are standing near each other, with Orko inside a yellow chestplate over a blue shirt and black pants. The scene is drawn in the style of the Masters of the Universe cartoon series.\"\n\n\"An emoji, digital illustration, playful, whimsical. A cartoon zombie character with green skin and tattered clothes reaches forward with two hands, they have green skin, messy hair, an open mouth and gaping teeth, one eye is half closed.\"\n",
      caption_prefix: "",
      caption_suffix: "",
      message_prompt: "Caption this image please",
      openai_api_key: "",
      image_zip_archive: "https://replicate.delivery/pbxt/LREOQCiXFRxVaSpwt2MYMwuwiEMIuiIw8YPm7rLLGPH94f57/Archive.zip",
      resize_images_for_captioning: true
    }
  }
);

// To access the file URL:
console.log(output.url()); //=> "http://example.com"

// To write the file to disk:
fs.writeFile("my-image.png", output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run fofr/batch-image-captioning using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "fofr/batch-image-captioning:d0adb15f4826881a68f1d82e0b10fe2ee1af536632dc8313f7f777ed8d264726",
    input={
        "model": "gpt-4o-2024-08-06",
        "max_dimension": 1024,
        "system_prompt": "Write a four sentence caption for this image. In the first sentence describe the style and type (painting, photo, etc) of the image. Describe in the remaining sentences the contents and composition of the image. Only use language that would be used to prompt a text to image model. Do not include usage. Comma separate keywords rather than using \"or\". Precise composition is important. Avoid phrases like \"conveys a sense of\" and \"capturing the\", just use the terms themselves.\n\nGood examples are:\n\n\"Photo of an alien woman with a glowing halo standing on top of a mountain, wearing a white robe and silver mask in the futuristic style with futuristic design, sky background, soft lighting, dynamic pose, a sense of future technology, a science fiction movie scene rendered in the Unreal Engine.\"\n\n\"A scene from the cartoon series Masters of the Universe depicts Man-At-Arms wearing a gray helmet and gray armor with red gloves. He is holding an iron bar above his head while looking down on Orko, a pink blob character. Orko is sitting behind Man-At-Arms facing left on a chair. Both characters are standing near each other, with Orko inside a yellow chestplate over a blue shirt and black pants. The scene is drawn in the style of the Masters of the Universe cartoon series.\"\n\n\"An emoji, digital illustration, playful, whimsical. A cartoon zombie character with green skin and tattered clothes reaches forward with two hands, they have green skin, messy hair, an open mouth and gaping teeth, one eye is half closed.\"\n",
        "caption_prefix": "",
        "caption_suffix": "",
        "message_prompt": "Caption this image please",
        "openai_api_key": "",
        "image_zip_archive": "https://replicate.delivery/pbxt/LREOQCiXFRxVaSpwt2MYMwuwiEMIuiIw8YPm7rLLGPH94f57/Archive.zip",
        "resize_images_for_captioning": True
    }
)

# To access the file URL:
print(output.url())
#=> "http://example.com"

# To write the file to disk:
with open("my-image.png", "wb") as file:
    file.write(output.read())

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run fofr/batch-image-captioning using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "fofr/batch-image-captioning:d0adb15f4826881a68f1d82e0b10fe2ee1af536632dc8313f7f777ed8d264726",
    "input": {
      "model": "gpt-4o-2024-08-06",
      "max_dimension": 1024,
      "system_prompt": "Write a four sentence caption for this image. In the first sentence describe the style and type (painting, photo, etc) of the image. Describe in the remaining sentences the contents and composition of the image. Only use language that would be used to prompt a text to image model. Do not include usage. Comma separate keywords rather than using \\"or\\". Precise composition is important. Avoid phrases like \\"conveys a sense of\\" and \\"capturing the\\", just use the terms themselves.\\n\\nGood examples are:\\n\\n\\"Photo of an alien woman with a glowing halo standing on top of a mountain, wearing a white robe and silver mask in the futuristic style with futuristic design, sky background, soft lighting, dynamic pose, a sense of future technology, a science fiction movie scene rendered in the Unreal Engine.\\"\\n\\n\\"A scene from the cartoon series Masters of the Universe depicts Man-At-Arms wearing a gray helmet and gray armor with red gloves. He is holding an iron bar above his head while looking down on Orko, a pink blob character. Orko is sitting behind Man-At-Arms facing left on a chair. Both characters are standing near each other, with Orko inside a yellow chestplate over a blue shirt and black pants. The scene is drawn in the style of the Masters of the Universe cartoon series.\\"\\n\\n\\"An emoji, digital illustration, playful, whimsical. A cartoon zombie character with green skin and tattered clothes reaches forward with two hands, they have green skin, messy hair, an open mouth and gaping teeth, one eye is half closed.\\"\\n",
      "caption_prefix": "",
      "caption_suffix": "",
      "message_prompt": "Caption this image please",
      "openai_api_key": "",
      "image_zip_archive": "https://replicate.delivery/pbxt/LREOQCiXFRxVaSpwt2MYMwuwiEMIuiIw8YPm7rLLGPH94f57/Archive.zip",
      "resize_images_for_captioning": true
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

captions_and_csv.zip

{
  "completed_at": "2024-08-13T11:32:47.134586Z",
  "created_at": "2024-08-13T11:31:46.854000Z",
  "data_removed": false,
  "error": null,
  "id": "8zj4ygh84srg80ch9e19xw111m",
  "input": {
    "model": "gpt-4o-2024-08-06",
    "max_dimension": 1024,
    "system_prompt": "Write a four sentence caption for this image. In the first sentence describe the style and type (painting, photo, etc) of the image. Describe in the remaining sentences the contents and composition of the image. Only use language that would be used to prompt a text to image model. Do not include usage. Comma separate keywords rather than using \"or\". Precise composition is important. Avoid phrases like \"conveys a sense of\" and \"capturing the\", just use the terms themselves.\n\nGood examples are:\n\n\"Photo of an alien woman with a glowing halo standing on top of a mountain, wearing a white robe and silver mask in the futuristic style with futuristic design, sky background, soft lighting, dynamic pose, a sense of future technology, a science fiction movie scene rendered in the Unreal Engine.\"\n\n\"A scene from the cartoon series Masters of the Universe depicts Man-At-Arms wearing a gray helmet and gray armor with red gloves. He is holding an iron bar above his head while looking down on Orko, a pink blob character. Orko is sitting behind Man-At-Arms facing left on a chair. Both characters are standing near each other, with Orko inside a yellow chestplate over a blue shirt and black pants. The scene is drawn in the style of the Masters of the Universe cartoon series.\"\n\n\"An emoji, digital illustration, playful, whimsical. A cartoon zombie character with green skin and tattered clothes reaches forward with two hands, they have green skin, messy hair, an open mouth and gaping teeth, one eye is half closed.\"\n",
    "caption_prefix": "",
    "caption_suffix": "",
    "message_prompt": "Caption this image please",
    "openai_api_key": "[REDACTED]",
    "image_zip_archive": "https://replicate.delivery/pbxt/LREOQCiXFRxVaSpwt2MYMwuwiEMIuiIw8YPm7rLLGPH94f57/Archive.zip",
    "resize_images_for_captioning": true
  },
  "logs": "Files extracted:\n/tmp/outputs/2024-06-01--15-16-53-u-q3-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png\n/tmp/outputs/2024-06-01--15-16-53-u-q1-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png\n/tmp/outputs/2024-06-01--15-16-53-u-q4-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png\n/tmp/outputs/2024-06-01--15-16-53-u-q2-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png\nNumber of images to be captioned: 4\n===================================================\nProcessing 2024-06-01--15-16-53-u-q3-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png\nResized from 928x1232 to 771x1024\nCaption: Digital artwork of an abstract, cybernetic rabbit. The rabbit is composed of intricate, neon-like lines and patterns in blue and white, with a luminous, swirling design. Its eyes are glowing red, and it sits against a dark background. A spherical, luminous object hovers in the top right corner, casting a mystical glow.\n===================================================\nProcessing 2024-06-01--15-16-53-u-q1-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png\nResized from 928x1232 to 771x1024\nCaption: Abstract digital illustration of a rabbit, featuring dynamic, swirling lines and vivid colors. The focus is on the rabbit's glowing red eyes and long ears, drawn in a sketchy style. The background is a dark, cosmic mix of deep blues and bright reds, creating a sense of mystery. Thin, energetic white lines surround the figure, adding motion and intensity.\n===================================================\nProcessing 2024-06-01--15-16-53-u-q4-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png\nResized from 928x1232 to 771x1024\nCaption: Abstract digital painting featuring an intense, mischievous creature with large ears and glowing yellow eyes. The figure is surrounded by a chaotic swirl of red and blue strokes, giving a sense of movement and energy. The creature's wide grin and sharp features contribute to its menacing presence. Dark background contrasts with vibrant colors, enhancing the dramatic effect.\n===================================================\nProcessing 2024-06-01--15-16-53-u-q2-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png\nResized from 928x1232 to 771x1024\nCaption: Digital painting, abstract, vibrant. A stylized creature resembling a rabbit is depicted in dark blue and black tones. Bold, swirling yellow and white strokes create dynamic movement around the creature. The background is filled with intricate patterns, deep shadows, and highlights.\n===================================================",
  "metrics": {
    "predict_time": 23.038720888,
    "total_time": 60.280586
  },
  "output": "https://replicate.delivery/czjl/U2CPf1tLXev5gkSFqfxUUeD1cqyyi2VWS0fGB6iBmgL7L7RaC/captions_and_csv.zip",
  "started_at": "2024-08-13T11:32:24.095865Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/8zj4ygh84srg80ch9e19xw111m",
    "cancel": "https://api.replicate.com/v1/predictions/8zj4ygh84srg80ch9e19xw111m/cancel"
  },
  "version": "3adde40e56d70b1ff1a6f1300da81b8af9a0f7983163f83022ebdb2c911fdc49"
}

Generated in

23.0 seconds

Tweak it ShareReport View full prediction

Files extracted:
/tmp/outputs/2024-06-01--15-16-53-u-q3-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png
/tmp/outputs/2024-06-01--15-16-53-u-q1-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png
/tmp/outputs/2024-06-01--15-16-53-u-q4-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png
/tmp/outputs/2024-06-01--15-16-53-u-q2-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png
Number of images to be captioned: 4
===================================================
Processing 2024-06-01--15-16-53-u-q3-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png
Resized from 928x1232 to 771x1024
Caption: Digital artwork of an abstract, cybernetic rabbit. The rabbit is composed of intricate, neon-like lines and patterns in blue and white, with a luminous, swirling design. Its eyes are glowing red, and it sits against a dark background. A spherical, luminous object hovers in the top right corner, casting a mystical glow.
===================================================
Processing 2024-06-01--15-16-53-u-q1-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png
Resized from 928x1232 to 771x1024
Caption: Abstract digital illustration of a rabbit, featuring dynamic, swirling lines and vivid colors. The focus is on the rabbit's glowing red eyes and long ears, drawn in a sketchy style. The background is a dark, cosmic mix of deep blues and bright reds, creating a sense of mystery. Thin, energetic white lines surround the figure, adding motion and intensity.
===================================================
Processing 2024-06-01--15-16-53-u-q4-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png
Resized from 928x1232 to 771x1024
Caption: Abstract digital painting featuring an intense, mischievous creature with large ears and glowing yellow eyes. The figure is surrounded by a chaotic swirl of red and blue strokes, giving a sense of movement and energy. The creature's wide grin and sharp features contribute to its menacing presence. Dark background contrasts with vibrant colors, enhancing the dramatic effect.
===================================================
Processing 2024-06-01--15-16-53-u-q2-fofr_pikachu_91215d95-1cb7-43c3-9448-5d97975efcf1.png
Resized from 928x1232 to 771x1024
Caption: Digital painting, abstract, vibrant. A stylized creature resembling a rabbit is depicted in dark blue and black tones. Bold, swirling yellow and white strokes create dynamic movement around the creature. The background is filled with intricate patterns, deep shadows, and highlights.
===================================================

This output was created using a different version of the model, fofr/batch-image-captioning:3adde40e.

Run time and cost

This model runs on CPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Batch image captioning

A cog model for batch image captioning using various AI from OpenAI, Anthropic, and Google’s Generative AI,

Features

Process multiple images from a ZIP archive
supports png, jpg, jpeg, webp
Optional image resizing for more cost-effective captioning
Customizable caption prefixes and suffixes
Support for multiple AI models:
OpenAI: GPT-4 and variants
Anthropic: Claude-3.5, Claude-3 variants
Google: Gemini-1.5 variants
Flexible system and message prompts
Error handling and retry mechanism
Output as a ZIP file containing captions that match image filenames as well as a CSV summary