Readme
HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation
Overview
HunyuanImage-2.1 is a highly efficient text-to-image model capable of generating 2K resolution images (2048 × 2048). Developed by Tencent, it features advanced architecture and training techniques for superior image quality and text alignment.
Key Features
- High-Quality Generation: Produces ultra-high-definition (2K) images with cinematic composition
- Multilingual Support: Native support for both Chinese and English prompts
- Advanced Architecture: Multi-modal, single- and dual-stream DiT (Diffusion Transformer) backbone
- Glyph-Aware Processing: ByT5 text rendering for improved text generation accuracy
- Flexible Aspect Ratios: Supports 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3 ratios
- Prompt Enhancement: Automatic prompt rewriting for better image quality
Architecture
The model consists of two main stages:
Stage 1: Base Text-to-Image Model
- Dual Text Encoders:
- Multimodal large language model (MLLM) for improved image-text alignment
- Multi-language, character-aware encoder for enhanced text rendering
- Network: Single- and dual-stream diffusion transformer with 17 billion parameters
- Optimization: Reinforcement Learning from Human Feedback (RLHF) for aesthetics and structural coherence
Stage 2: Refiner Model
- Enhances image quality and clarity
- Minimizes artifacts
Technical Components
High-Compression VAE
- 32× compression rate reduces input tokens for DiT model
- Aligned with DINOv2 features for efficient training
- Generates 2K images with same token length as other models’ 1K images
Training Innovations
- Structured Captions: Hierarchical semantic information at multiple levels
- OCR Agent and IP RAG: Addresses VLM captioner shortcomings
- REPA Training: Multi-bucket, multi-resolution loss for faster convergence
PromptEnhancer Module
- First systematic industrial-level rewriting model
- Supports Chinese and English rewriting
- Uses fine-grained semantic AlignEvaluator with 6 categories and 24 assessment points
Model Distillation
- Novel meanflow-based distillation method
- Enables high-quality generation with few sampling steps
- First successful industrial-scale meanflow application
System Requirements
Hardware
- NVIDIA GPU with CUDA support
- Minimum: 36 GB GPU memory for 2048×2048 generation
- Linux operating system
Supported Resolutions
- 16:9: 2560×1536
- 4:3: 2304×1792
- 1:1: 2048×2048
- 3:4: 1792×2304
- 9:16: 1536×2560
Performance
SSAE Evaluation
- Mean Image Accuracy: 0.8888
- Global Accuracy: 0.8832
- Best performance among open-source models
- Comparable to closed-source commercial models
GSB Evaluation
- -1.36% vs Seedream3.0 (closed-source)
- +2.89% vs Qwen-Image (open-source)
- Demonstrates competitive performance with commercial models
Links
- Code: GitHub Repository
- Demo: HuggingFace Space
- Models: HuggingFace Hub
- PromptEnhancer: Project Page
Citation
@misc{HunyuanImage-2.1,
title={HunyuanImage 2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation},
author={Tencent Hunyuan Team},
year={2025},
howpublished={\url{https://github.com/Tencent-Hunyuan/HunyuanImage-2.1}},
}