adirik/kosmos-g – Run with an API on Replicate

adirik / kosmos-g

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Cold

Public
4.3K runs
L40S
GitHub
Paper
License

Run with an API

Playground API Examples README Versions

Input

Run this model in Node.js with one line of code:

npx create-replicate --model=adirik/kosmos-g

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run adirik/kosmos-g using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "adirik/kosmos-g:56f9fde586eeecfd03c9c34da1c40f5e513af2d511d4b1961f810df1334cc6e9",
  {
    input: {
      seed: 20,
      image1: "https://replicate.delivery/pbxt/K0Tzk2bsc76SSsRgrJSoh0fnzrB5M0Dqeqe7YHLxf1x7fU4S/FELV-cat.jpg",
      image2: "https://replicate.delivery/pbxt/K0TzkJl3lco0aPVIg8iQYUB0ursK7ZWO0ECEVsxMMQlf5eKH/ironman.jpg",
      prompt: "<i> in the suit of <i>",
      num_images: 1,
      negative_prompt: "",
      num_inference_steps: 50,
      text_guidance_scale: 6
    }
  }
);
console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run adirik/kosmos-g using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "adirik/kosmos-g:56f9fde586eeecfd03c9c34da1c40f5e513af2d511d4b1961f810df1334cc6e9",
    input={
        "seed": 20,
        "image1": "https://replicate.delivery/pbxt/K0Tzk2bsc76SSsRgrJSoh0fnzrB5M0Dqeqe7YHLxf1x7fU4S/FELV-cat.jpg",
        "image2": "https://replicate.delivery/pbxt/K0TzkJl3lco0aPVIg8iQYUB0ursK7ZWO0ECEVsxMMQlf5eKH/ironman.jpg",
        "prompt": "<i> in the suit of <i>",
        "num_images": 1,
        "negative_prompt": "",
        "num_inference_steps": 50,
        "text_guidance_scale": 6
    }
)
print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run adirik/kosmos-g using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "56f9fde586eeecfd03c9c34da1c40f5e513af2d511d4b1961f810df1334cc6e9",
    "input": {
      "seed": 20,
      "image1": "https://replicate.delivery/pbxt/K0Tzk2bsc76SSsRgrJSoh0fnzrB5M0Dqeqe7YHLxf1x7fU4S/FELV-cat.jpg",
      "image2": "https://replicate.delivery/pbxt/K0TzkJl3lco0aPVIg8iQYUB0ursK7ZWO0ECEVsxMMQlf5eKH/ironman.jpg",
      "prompt": "<i> in the suit of <i>",
      "num_images": 1,
      "negative_prompt": "",
      "num_inference_steps": 50,
      "text_guidance_scale": 6
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

You can run this model locally using Cog. First, install Cog:

brew install cog

If you don’t have Homebrew, there are other installation options available.

Run this to download the model and run it in your local environment:

cog predict r8.im/adirik/kosmos-g@sha256:56f9fde586eeecfd03c9c34da1c40f5e513af2d511d4b1961f810df1334cc6e9 \
  -i 'seed=20' \
  -i 'image1="https://replicate.delivery/pbxt/K0Tzk2bsc76SSsRgrJSoh0fnzrB5M0Dqeqe7YHLxf1x7fU4S/FELV-cat.jpg"' \
  -i 'image2="https://replicate.delivery/pbxt/K0TzkJl3lco0aPVIg8iQYUB0ursK7ZWO0ECEVsxMMQlf5eKH/ironman.jpg"' \
  -i 'prompt="<i> in the suit of <i>"' \
  -i 'num_images=1' \
  -i 'negative_prompt=""' \
  -i 'num_inference_steps=50' \
  -i 'text_guidance_scale=6'

To learn more, take a look at the Cog documentation.

Run this to download the model and run it in your local environment:

docker run -d -p 5000:5000 --gpus=all r8.im/adirik/kosmos-g@sha256:56f9fde586eeecfd03c9c34da1c40f5e513af2d511d4b1961f810df1334cc6e9
curl -s -X POST \
  -H "Content-Type: application/json" \
  -d $'{
    "input": {
      "seed": 20,
      "image1": "https://replicate.delivery/pbxt/K0Tzk2bsc76SSsRgrJSoh0fnzrB5M0Dqeqe7YHLxf1x7fU4S/FELV-cat.jpg",
      "image2": "https://replicate.delivery/pbxt/K0TzkJl3lco0aPVIg8iQYUB0ursK7ZWO0ECEVsxMMQlf5eKH/ironman.jpg",
      "prompt": "<i> in the suit of <i>",
      "num_images": 1,
      "negative_prompt": "",
      "num_inference_steps": 50,
      "text_guidance_scale": 6
    }
  }' \
  http://localhost:5000/predictions

To learn more, take a look at the Cog documentation.

Output

Generated in

4.6 seconds

Tweak itReport

Examples

View more examples

Run time and cost

This model costs approximately $0.17 to run on Replicate, or 5 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 3 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Model Description

Kosmos-G by Microsoft is a multi-modal model that can generate images from multi-modal prompts. Kosmos-G can generate image variations and perform image mixing out of the box, on top of text driven generation and personalization of images.

Abstract: Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I) generation have made significant strides. However, the generation from generalized vision-language inputs, especially involving multiple images, remains under-explored. This paper presents Kosmos-G, a model that leverages the advanced perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates a unique capability of zero-shot multi-entity subject-driven generation. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of “image as a foreign language in image generation.”

See the paper, official repository and project page for more information.

Usage

Kosmos-G expects multi-modal input in the format of one or more images and a text prompt to guide the generation. Images are denoted with within the text prompt - e.g. “ standing next to “. You can additionally input a negative_prompt to guide the diffusion process.

Citation

@article{kosmos-g, title={{Kosmos-G}: Generating Images in Context with Multimodal Large Language Models}, author={Xichen Pan and Li Dong and Shaohan Huang and Zhiliang Peng and Wenhu Chen and Furu Wei}, journal={ArXiv}, year={2023}, volume={abs/2310.02992} }