chenxwh / textdiffuser

Diffusion Models as Text Painters

  • Public
  • 1.7K runs
  • GitHub
  • Paper
  • License



Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 10 minutes. The predict time for this model varies significantly based on the inputs.


TextDiffuser: Diffusion Models as Text Painters

TextDiffuser generates images with visually appealing text that is coherent with backgrounds. It is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.


  • We propose TextDiffuser, which is a two-stage diffusion-based framework for text rendering. It generates accurate and coherent text images from text prompts or additionally with template images, as well as conducting text inpainting to reconstruct incomplete images.

  • We release MARIO-10M, containing large-scale image-text pairs with OCR annotations, including text recognition, detection, and character-level segmentation masks. (To be released)


We sincerely thank the following projects: Hugging Face Diffuser, LAION, DB, PARSeq, img2dataset.

Also, special thanks to the open-source diffusion project or available demo: DALLE, Stable Diffusion, Stable Diffusion XL, Midjourney, ControlNet, DeepFloyd.


For help or issues using TextDiffuser, please email Jingye Chen (, Yupan Huang ( or submit a GitHub issue.

For other communications related to TextDiffuser, please contact Lei Cui ( or Furu Wei (


If you find this code useful in your research, please consider citing:

  title={TextDiffuser: Diffusion Models as Text Painters},
  author={Chen, Jingye and Huang, Yupan and Lv, Tengchao and Cui, Lei and Chen, Qifeng and Wei, Furu},
  journal={arXiv preprint arXiv:2305.10855},