Readme
Omost
Omost is a project to convert LLM’s coding capability to image generation (or more accurately, image composing) capability.
The name Omost
(pronunciation: almost) has two meanings: 1) everytime after you use Omost, your image is almost there; 2) the O
mean “omni” (multi-modal) and most
means we want to get the most out of it.
Omost provides LLMs models that will write codes to compose image visual contents with Omost’s virtual Canvas
agent. This Canvas
can be rendered by specific implementations of image generators to actually generate images.
Currently, we provide 3 pretrained LLM models based on variations of Llama3 and Phi3 (see also the model notes at the end of this page).
All models are trained with mixed data of (1) ground-truth annotations of several datasets including Open-Images, (2) extracted data by automatically annotating images, (3) reinforcement from DPO (Direct Preference Optimization, “whether the codes can be compiled by python 3.10 or not” as a direct preference), and (4) a small amount of tuning data from OpenAI GPT4o’s multi-modal capability.
Some notes:
- The recommended quant for
omost-llama-3-8b
is 4bits, and foromost-phi-3-mini-128k
(3.8B) is 8 bits. They all fit in 8GB VRAM without offloads. The performance degradation caused by quant is very minimal and I personally never observed any evidences of degradation. However, quantomost-phi-3-mini-128k
into 4 bits is not recommended since I noticed some obvious performance degradation. The 4bit inference ofomost-phi-3-mini-128k
should be viewed as a last method in extreme cases when you really do not have more capable GPUs. - My user study shows that
omost-llama-3-8b-4bits
>omost-dolphin-2.9-llama3-8b-4bits
>omost-phi-3-mini-128k-8bits
. So in most cases one should just useomost-llama-3-8b-4bits
. - The
omost-llama-3-8b
andomost-phi-3-mini-128k
are trained with filtered safe data without NSFW or inappropriate contents. See (4) if you need a different option. - The
omost-dolphin-2.9-llama3-8b
is trained with all data WITHOUT any filtering. You must apply your own safety alignment methods if you expose any service ofomost-dolphin-2.9-llama3-8b
to public. - Note that the filtering in (3) is not because of any policy - the reason is that I noticed slight instability in training gradients in those models since they are pretrained with instruct following regulated by safety alignment, causing the performance to degrade a bit. But the instruct following of
omost-dolphin-2.9-llama3-8b
is pretrained with community efforts and do not have this problem. - The 128k context length of
omost-phi-3-mini-128k
cannot be trusted. The performance of it will degrade a lot after the tokens reach about 8k. One should just view it as a model with about 8k content length. - A model of 8k context length can do about 5 to 6 rounds of conversational editing. If you are about to run out of token lengths, use the UI to modify your message and respond again (this can be done with infinite times).
- All models are fully trained with our H100 clusters at precision fp16 without any tricks like quant or Q-LoRA etc. The optimizer is Adam without any tricks.
- You must also follow the licenses of Llama-3 and Phi-3.
- You can request us to train on other LLMs if reasonable and necessary.
Cite
@Misc{omost,
author = {Omost Team},
title = {Omost GitHub Page},
year = {2024},
}
Related Work
Also read …
DOCCI: Descriptions of Connected and Contrasting Images
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models and Self-correcting LLM-controlled Diffusion Models
MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation