tencent/hunyuan-image-2.1

Generate high-quality 2K resolution images from text prompts

168 runs

HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation

Overview

HunyuanImage-2.1 is a highly efficient text-to-image model capable of generating 2K resolution images (2048 × 2048). Developed by Tencent, it features advanced architecture and training techniques for superior image quality and text alignment.

Key Features

  • High-Quality Generation: Produces ultra-high-definition (2K) images with cinematic composition
  • Multilingual Support: Native support for both Chinese and English prompts
  • Advanced Architecture: Multi-modal, single- and dual-stream DiT (Diffusion Transformer) backbone
  • Glyph-Aware Processing: ByT5 text rendering for improved text generation accuracy
  • Flexible Aspect Ratios: Supports 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3 ratios
  • Prompt Enhancement: Automatic prompt rewriting for better image quality

Architecture

The model consists of two main stages:

Stage 1: Base Text-to-Image Model

  • Dual Text Encoders:
  • Multimodal large language model (MLLM) for improved image-text alignment
  • Multi-language, character-aware encoder for enhanced text rendering
  • Network: Single- and dual-stream diffusion transformer with 17 billion parameters
  • Optimization: Reinforcement Learning from Human Feedback (RLHF) for aesthetics and structural coherence

Stage 2: Refiner Model

  • Enhances image quality and clarity
  • Minimizes artifacts

Technical Components

High-Compression VAE

  • 32× compression rate reduces input tokens for DiT model
  • Aligned with DINOv2 features for efficient training
  • Generates 2K images with same token length as other models’ 1K images

Training Innovations

  • Structured Captions: Hierarchical semantic information at multiple levels
  • OCR Agent and IP RAG: Addresses VLM captioner shortcomings
  • REPA Training: Multi-bucket, multi-resolution loss for faster convergence

PromptEnhancer Module

  • First systematic industrial-level rewriting model
  • Supports Chinese and English rewriting
  • Uses fine-grained semantic AlignEvaluator with 6 categories and 24 assessment points

Model Distillation

  • Novel meanflow-based distillation method
  • Enables high-quality generation with few sampling steps
  • First successful industrial-scale meanflow application

System Requirements

Hardware

  • NVIDIA GPU with CUDA support
  • Minimum: 36 GB GPU memory for 2048×2048 generation
  • Linux operating system

Supported Resolutions

  • 16:9: 2560×1536
  • 4:3: 2304×1792
  • 1:1: 2048×2048
  • 3:4: 1792×2304
  • 9:16: 1536×2560

Performance

SSAE Evaluation

  • Mean Image Accuracy: 0.8888
  • Global Accuracy: 0.8832
  • Best performance among open-source models
  • Comparable to closed-source commercial models

GSB Evaluation

  • -1.36% vs Seedream3.0 (closed-source)
  • +2.89% vs Qwen-Image (open-source)
  • Demonstrates competitive performance with commercial models

Citation

@misc{HunyuanImage-2.1,
  title={HunyuanImage 2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation},
  author={Tencent Hunyuan Team},
  year={2025},
  howpublished={\url{https://github.com/Tencent-Hunyuan/HunyuanImage-2.1}},
}