adirik / kosmos-g

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

  • Public
  • 4.1K runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.17 to run on Replicate, or 5 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 3 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Model Description

Kosmos-G by Microsoft is a multi-modal model that can generate images from multi-modal prompts. Kosmos-G can generate image variations and perform image mixing out of the box, on top of text driven generation and personalization of images.

Abstract: Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I) generation have made significant strides. However, the generation from generalized vision-language inputs, especially involving multiple images, remains under-explored. This paper presents Kosmos-G, a model that leverages the advanced perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates a unique capability of zero-shot multi-entity subject-driven generation. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of “image as a foreign language in image generation.”

See the paper, official repository and project page for more information.

Usage

Kosmos-G expects multi-modal input in the format of one or more images and a text prompt to guide the generation. Images are denoted with within the text prompt - e.g. “ standing next to “. You can additionally input a negative_prompt to guide the diffusion process.

Citation

@article{kosmos-g,
  title={{Kosmos-G}: Generating Images in Context with Multimodal Large Language Models},
  author={Xichen Pan and Li Dong and Shaohan Huang and Zhiliang Peng and Wenhu Chen and Furu Wei},
  journal={ArXiv},
  year={2023},
  volume={abs/2310.02992}
}