Readme
๐จ HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation
๐งฉ Community Contributions
If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know.
๐๏ธ Contents
- ๐ฅ๐ฅ๐ฅ News
- ๐งฉ Community Contributions
- ๐ Open-source Plan
- ๐ Introduction
- โจ Key Features
- ๐ ๏ธ Dependencies and Installation
- ๐ป System Requirements
- ๐ฆ Environment Setup
- ๐ฅ Install Dependencies
- Performance Optimizations
- ๐ Usage
- ๐ฅ Quick Start with Transformers
- ๐ Local Installation & Usage
- ๐จ Interactive Gradio Demo
- ๐งฑ Models Cards
- ๐ Prompt Guide
- Manually Writing Prompts
- System Prompt For Automatic Rewriting the Prompt
- Advanced Tips
- More Cases
- ๐ Evaluation
- ๐ Citation
- ๐ Acknowledgements
- ๐๐ Github Star History
๐ Introduction
HunyuanImage-3.0 is a groundbreaking native multimodal model that unifies multimodal understanding and generation within an autoregressive framework. Our text-to-image module achieves performance comparable to or surpassing leading closed-source models.
โจ Key Features
-
๐ง Unified Multimodal Architecture: Moving beyond the prevalent DiT-based architectures, HunyuanImage-3.0 employs a unified autoregressive framework. This design enables a more direct and integrated modeling of text and image modalities, leading to surprisingly effective and contextually rich image generation.
-
๐ The Largest Image Generation MoE Model: This is the largest open-source image generation Mixture of Experts (MoE) model to date. It features 64 experts and a total of 80 billion parameters, with 13 billion activated per token, significantly enhancing its capacity and performance.
-
๐จ Superior Image Generation Performance: Through rigorous dataset curation and advanced reinforcement learning post-training, weโve achieved an optimal balance between semantic accuracy and visual excellence. The model demonstrates exceptional prompt adherence while delivering photorealistic imagery with stunning aesthetic quality and fine-grained details.
-
๐ญ Intelligent World-Knowledge Reasoning: The unified multimodal architecture endows HunyuanImage-3.0 with powerful reasoning capabilities. It leverages its extensive world knowledge to intelligently interpret user intent, automatically elaborating on sparse prompts with contextually appropriate details to produce superior, more complete visual outputs.
๐ ๏ธ Dependencies and Installation
๐ป System Requirements
- ๐ฅ๏ธ Operating System: Linux
- ๐ฎ GPU: NVIDIA GPU with CUDA support
- ๐พ Disk Space: 170GB for model weights
- ๐ง GPU Memory: โฅ3ร80GB (4ร80GB recommended for better performance)
๐ฆ Environment Setup
- ๐ Python: 3.12+ (recommended and tested)
- ๐ฅ PyTorch: 2.7.1
- โก CUDA: 12.8
๐งฑ Models Cards
Model | Params | Download | Recommended VRAM | Supported |
---|---|---|---|---|
HunyuanImage-3.0 | 80B total (13B active) | HuggingFace | โฅ 3 ร 80 GB | โ Text-to-Image |
HunyuanImage-3.0-Instruct | 80B total (13B active) | HuggingFace | โฅ 3 ร 80 GB | โ
Text-to-Image โ Prompt Self-Rewrite โ CoT Think |
Notes: - Install performance extras (FlashAttention, FlashInfer) for faster inference. - MultiโGPU inference is recommended for the Base model.
๐ Prompt Guide
Manually Writing Prompts.
The Pretrain Checkpoint does not automatically rewrite or enhance input prompts, Instruct Checkpoint can rewrite or enhance input prompts with thinking . For optimal results currently, we recommend community partners consulting our official guide on how to write effective prompts.
Reference: HunyuanImage 3.0 Prompt Handbook
System Prompt For Automatic Rewriting the Prompt.
Weโve included two system prompts in the PE folder of this repository that leverage DeepSeek to automatically enhance user inputs:
- system_prompt_universal: This system prompt converts photographic style, artistic prompts into a detailed one.
- system_prompt_text_rendering: This system prompt converts UI/Poster/Text Rending prompts to a deailed on that suits the model.
Note that these system prompts are in Chinese because Deepseek works better with Chinese system prompts. If you want to use it for English oriented model, you may translate it into English or refer to the comments in the PE file as a guide.
We also create a Yuanqi workflow to implement the universal one, you can directly try it.
Advanced Tips
-
Content Priority: Focus on describing the main subject and action first, followed by details about the environment and style. A more general description framework is: Main subject and scene + Image quality and style + Composition and perspective + Lighting and atmosphere + Technical parameters. Keywords can be added both before and after this structure.
-
Image resolution: Our model not only supports multiple resolutions but also offers both automatic and specified resolution options. In auto mode, the model automatically predicts the image resolution based on the input prompt. In specified mode (like traditional DiT), the model outputs an image resolution that strictly aligns with the userโs chosen resolution.
More Cases
Our model can follow complex instructions to generate highโquality, creative images.
Our model can effectively process very long text inputs, enabling users to precisely control the finer details of generated images. Extended prompts allow for intricate elements to be accurately captured, making it ideal for complex projects requiring precision and creativity.
![]() |
![]() |
๐ Evaluation
- ๐ค SSAE (Machine Evaluation)
SSAE (Structured Semantic Alignment Evaluation) is an intelligent evaluation metric for image-text alignment based on advanced multimodal large language models (MLLMs). We extracted 3500 key points across 12 categories, then used multimodal large language models to automatically evaluate and score by comparing the generated images with these key points based on the visual content of the images. Mean Image Accuracy represents the image-wise average score across all key points, while Global Accuracy directly calculates the average score across all key points.
- ๐ฅ GSB (Human Evaluation)
We adopted the GSB (Good/Same/Bad) evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1,000 text prompts, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators.
๐ Citation
If you find HunyuanImage-3.0 useful in your research, please cite our work:
@article{cao2025hunyuanimage,
title={HunyuanImage 3.0 Technical Report},
author={Cao, Siyu and Chen, Hangting and Chen, Peng and Cheng, Yiji and Cui, Yutao and Deng, Xinchi and Dong, Ying and Gong, Kipper and Gu, Tianpeng and Gu, Xiusen and others},
journal={arXiv preprint arXiv:2509.23951},
year={2025}
}
๐ Acknowledgements
We extend our heartfelt gratitude to the following open-source projects and communities for their invaluable contributions:
- ๐ค Transformers - State-of-the-art NLP library
- ๐จ Diffusers - Diffusion models library
- ๐ HuggingFace - AI model hub and community
- โก FlashAttention - Memory-efficient attention
- ๐ FlashInfer - Optimized inference engine