nvidia / sana

A fast image model with wide artistic range and resolutions up to 4096x4096

  • Public
  • 127.5K runs
  • H100
  • GitHub
  • Weights
  • Paper
  • License

Input

string
Shift + Return to add a new line

Input prompt

Default: "a cyberpunk cat with a neon sign that says \"Sana\""

string
Shift + Return to add a new line

Specify things to not see in the output

Default: ""

string

Model variant. 1600M variants are slower but produce higher quality than 600M, 1024px variants are optimized for 1024x1024px images, 512px variants are optimized for 512x512px images, 'multilang' variants can be prompted in both English and Chinese

Default: "1600M-1024px"

integer

Width of output image

Default: 1024

integer

Height of output image

Default: 1024

integer
(minimum: 1)

Number of denoising steps

Default: 18

number
(minimum: 1, maximum: 20)

Classifier-free guidance scale

Default: 5

number
(minimum: 1, maximum: 20)

PAG Guidance scale

Default: 2

integer

Random seed. Leave blank to randomize the seed

Output

output
Generated in

This output was created using a different version of the model, nvidia/sana:88312dcb.

Run time and cost

This model costs approximately $0.0097 to run on Replicate, or 103 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia H100 GPU hardware. Predictions typically complete within 7 seconds. The predict time for this model varies significantly based on the inputs.

Readme

⚡️Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

teaser_page1

💡 Introduction

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include:

(1) DC-AE: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. \ (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. \ (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. \ (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence.

As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 × 1024 resolution image. Sana enables content creation at low cost.

teaser_page2

🤗Acknowledgements

📖BibTeX

@misc{xie2024sana,
      title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
      author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Haotian Tang and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
      year={2024},
      eprint={2410.10629},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.10629},
    }