HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation

Overview

HunyuanImage-2.1 is a highly efficient text-to-image model capable of generating 2K resolution images (2048 × 2048). Developed by Tencent, it features advanced architecture and training techniques for superior image quality and text alignment.

Key Features

High-Quality Generation: Produces ultra-high-definition (2K) images with cinematic composition
Multilingual Support: Native support for both Chinese and English prompts
Advanced Architecture: Multi-modal, single- and dual-stream DiT (Diffusion Transformer) backbone
Glyph-Aware Processing: ByT5 text rendering for improved text generation accuracy
Flexible Aspect Ratios: Supports 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3 ratios
Prompt Enhancement: Automatic prompt rewriting for better image quality

Architecture

The model consists of two main stages:

Stage 1: Base Text-to-Image Model

Dual Text Encoders:
Multimodal large language model (MLLM) for improved image-text alignment
Multi-language, character-aware encoder for enhanced text rendering
Network: Single- and dual-stream diffusion transformer with 17 billion parameters
Optimization: Reinforcement Learning from Human Feedback (RLHF) for aesthetics and structural coherence

Stage 2: Refiner Model

Enhances image quality and clarity
Minimizes artifacts

Technical Components

High-Compression VAE

32× compression rate reduces input tokens for DiT model
Aligned with DINOv2 features for efficient training
Generates 2K images with same token length as other models’ 1K images

Training Innovations

Structured Captions: Hierarchical semantic information at multiple levels
OCR Agent and IP RAG: Addresses VLM captioner shortcomings
REPA Training: Multi-bucket, multi-resolution loss for faster convergence

PromptEnhancer Module

First systematic industrial-level rewriting model
Supports Chinese and English rewriting
Uses fine-grained semantic AlignEvaluator with 6 categories and 24 assessment points

Model Distillation

Novel meanflow-based distillation method
Enables high-quality generation with few sampling steps
First successful industrial-scale meanflow application

System Requirements

Hardware

NVIDIA GPU with CUDA support
Minimum: 36 GB GPU memory for 2048×2048 generation
Linux operating system

Supported Resolutions

16:9: 2560×1536
4:3: 2304×1792
1:1: 2048×2048
3:4: 1792×2304
9:16: 1536×2560

Performance

SSAE Evaluation

Mean Image Accuracy: 0.8888
Global Accuracy: 0.8832
Best performance among open-source models
Comparable to closed-source commercial models

GSB Evaluation

-1.36% vs Seedream3.0 (closed-source)
+2.89% vs Qwen-Image (open-source)
Demonstrates competitive performance with commercial models

Citation

@misc{HunyuanImage-2.1,
  title={HunyuanImage 2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation},
  author={Tencent Hunyuan Team},
  year={2025},
  howpublished={\url{https://github.com/Tencent-Hunyuan/HunyuanImage-2.1}},
}