chenxwh / omost

Convert LLM's coding to image generation

  • Public
  • 1.8K runs
  • GitHub
  • License

Run time and cost

This model costs approximately $0.13 to run on Replicate, or 7 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 4 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Omost

Omost is a project to convert LLM’s coding capability to image generation (or more accurately, image composing) capability.

The name Omost (pronunciation: almost) has two meanings: 1) everytime after you use Omost, your image is almost there; 2) the O mean “omni” (multi-modal) and most means we want to get the most out of it.

Omost provides LLMs models that will write codes to compose image visual contents with Omost’s virtual Canvas agent. This Canvas can be rendered by specific implementations of image generators to actually generate images.

Currently, we provide 3 pretrained LLM models based on variations of Llama3 and Phi3 (see also the model notes at the end of this page).

All models are trained with mixed data of (1) ground-truth annotations of several datasets including Open-Images, (2) extracted data by automatically annotating images, (3) reinforcement from DPO (Direct Preference Optimization, “whether the codes can be compiled by python 3.10 or not” as a direct preference), and (4) a small amount of tuning data from OpenAI GPT4o’s multi-modal capability.

Some notes:

  1. The recommended quant for omost-llama-3-8b is 4bits, and for omost-phi-3-mini-128k (3.8B) is 8 bits. They all fit in 8GB VRAM without offloads. The performance degradation caused by quant is very minimal and I personally never observed any evidences of degradation. However, quant omost-phi-3-mini-128k into 4 bits is not recommended since I noticed some obvious performance degradation. The 4bit inference of omost-phi-3-mini-128k should be viewed as a last method in extreme cases when you really do not have more capable GPUs.
  2. My user study shows that omost-llama-3-8b-4bits > omost-dolphin-2.9-llama3-8b-4bits > omost-phi-3-mini-128k-8bits. So in most cases one should just use omost-llama-3-8b-4bits.
  3. The omost-llama-3-8b and omost-phi-3-mini-128k are trained with filtered safe data without NSFW or inappropriate contents. See (4) if you need a different option.
  4. The omost-dolphin-2.9-llama3-8b is trained with all data WITHOUT any filtering. You must apply your own safety alignment methods if you expose any service of omost-dolphin-2.9-llama3-8b to public.
  5. Note that the filtering in (3) is not because of any policy - the reason is that I noticed slight instability in training gradients in those models since they are pretrained with instruct following regulated by safety alignment, causing the performance to degrade a bit. But the instruct following of omost-dolphin-2.9-llama3-8b is pretrained with community efforts and do not have this problem.
  6. The 128k context length of omost-phi-3-mini-128k cannot be trusted. The performance of it will degrade a lot after the tokens reach about 8k. One should just view it as a model with about 8k content length.
  7. A model of 8k context length can do about 5 to 6 rounds of conversational editing. If you are about to run out of token lengths, use the UI to modify your message and respond again (this can be done with infinite times).
  8. All models are fully trained with our H100 clusters at precision fp16 without any tricks like quant or Q-LoRA etc. The optimizer is Adam without any tricks.
  9. You must also follow the licenses of Llama-3 and Phi-3.
  10. You can request us to train on other LLMs if reasonable and necessary.

Cite

@Misc{omost,
  author = {Omost Team},
  title  = {Omost GitHub Page},
  year   = {2024},
}

Related Work

Also read …

DOCCI: Descriptions of Connected and Contrasting Images

(RPG-DiffusionMaster) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models and Self-correcting LLM-controlled Diffusion Models

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

sd-webui-regional-prompter