Qwen-Image H100 Optimized Build

High-performance text-to-image generation with direct HuggingFace loading, optimized for high-volume API usage on H100 GPUs.

⚠️ IMPORTANT: Cost Warning

This deployment loads official HuggingFace weights with BF16 precision for maximum stability.

Cold boot time: ~60-90 seconds (~$0.092 on H100)
Lightning mode: Cost-effective generation at ~$0.005 per image.
Standard mode: Optimized for high-quality generation.
With LoRAs: Additional time for first-time download/loading.

⚠️ Pricing Disclaimer: All cost calculations are based on Replicate pricing as of November 10, 2025. Prices may change. Use these estimates with caution and verify current rates at https://replicate.com/docs/topics/billing

👉 For sporadic use or faster cold starts, consider alternative versions:

https://replicate.com/qwen/qwen-image - An alternative with pre-baked weights for faster startup.

Use THIS version ONLY if you specifically need:

✅ Official HuggingFace weights/image via API (guaranteed official model, not pre-baked versions)
✅ Direct HuggingFace model loading for transparency and control
✅ Custom LoRA loading from URLs/HuggingFace
✅ Lightning mode for high-throughput, cost-effective generation
✅ High-volume batch processing

Why This Build?

This build is designed for users who prioritize transparency, flexibility, and control.

Loads official model directly from HuggingFace
Guaranteed to use the exact weights from HF, not pre-baked custom versions
Full transparency on model source and version
Dynamic LoRA loading from any URL
Optimized for sustained API usage with official weights
1-2 minutes cold start, then fast inference

Cost-Effectiveness Analysis

H100 GPU - $5.49/hour

Per-Image Cost Breakdown:

Scenario	Time	Cost	Notes
Cold boot	~60s	$0.092	First image only (one-time)
Standard (50 steps)	~34s	$0.052	1328×1328 at 1.45 it/s
Lightning (8 steps)	~3.5s	$0.005	1328×1328 at 2.5 it/s
HD Standard (50 steps)	~123s	$0.188	2048×2048 at 0.4 it/s
HD Lightning (8 steps)	~10s	$0.015	2048×2048 at 0.8 it/s

Session Economics (Standard Mode):

Duration	Images Generated	Total Cost	Cost per Image
1 hour	~104 images	$5.49	$0.053
2 hours	~210 images	$10.98	$0.052
4 hours	~423 images	$21.96	$0.052

Session Economics (Lightning Mode):

Duration	Images Generated	Total Cost	Cost per Image
1 hour	~1,011 images	$5.49	$0.0054
2 hours	~2,057 images	$10.98	$0.0053
4 hours	~4,114 images	$21.96	$0.0053

Break-Even Analysis

The one-time cold boot cost is amortized over the course of a session. The cost per image rapidly decreases after the first generation.

Standard Mode (50 steps):

The cost per image becomes more economical as more images are generated in a single session.
Best suited for larger batches where the initial cold boot cost is spread across many outputs.

Lightning Mode (8 steps):

Offers significant cost savings for high-volume generation.
The most cost-effective option for batch processing at any scale.

For users prioritizing the fastest possible cold start for single images, other versions with pre-warmed models may be more convenient.

Cost Calculator

Formula:

Total Cost = (Cold Boot + (Images × Inference Time)) × GPU Rate
GPU Rate: $0.001525/sec ($5.49/hour)

Example - 50 Images on H100 (Standard Mode):

Cold Boot:     60s × $0.001525/s = $0.092
Inference:  50 × 34s × $0.001525/s = $2.593
Total:                               $2.685
Per Image:                           $0.054

Example - 50 Images on H100 (Lightning Mode):

Cold Boot:     60s × $0.001525/s = $0.092
Inference:  50 × 3.5s × $0.001525/s = $0.267
Total:                                $0.359
Per Image:                            $0.007

Note: All pricing examples above are based on Replicate rates as of November 10, 2025. Verify current pricing at https://replicate.com/docs/topics/billing

This build makes economic sense when:

Using Lightning mode for cost-effective, high-volume generation.
Running large batches in Standard mode.
You need official HuggingFace weights/API access.
LoRA flexibility and custom model loading are required.

API Parameters

Core Parameters

Parameter	Type	Default	Range	Description
`prompt`	string	required	-	Text description of image to generate
`negative_prompt`	string	” “	-	Elements to avoid in generation
`aspect_ratio`	choice	“1:1”	See below	Output image aspect ratio
`num_inference_steps`	int	50	1-100	Denoising steps (more = better quality, slower)
`true_cfg_scale`	float	4.0	1.0-10.0	Guidance scale (higher = closer to prompt)
`seed`	int	random	-1 or 0+	Random seed (-1 for random)
`num_outputs`	int	1	1-2	Number of images to generate

LoRA Parameters

Parameter	Type	Default	Range	Description
`lora_weights`	string	”“	-	URL to main LoRA weights (.safetensors)
`lora_scale`	float	1.0	0.0-2.0	Main LoRA strength
`extra_lora_weights`	string	”“	-	URL to additional LoRA weights
`extra_lora_scale`	float	1.0	0.0-2.0	Additional LoRA strength

Speed & Output Parameters

Parameter	Type	Default	Range	Description
`go_fast`	bool	false	-	Enable Lightning mode (8 steps, 10× faster, good quality)
`output_format`	choice	“png”	png/jpg/webp	Output image format
`output_quality`	int	95	80-100	Quality for jpg/webp (higher = better)
`disable_safety_checker`	bool	false	-	Disable safety filtering

Aspect Ratios

Ratio	Dimensions	Best For
1:1	1328×1328	Social media, square posts
16:9	1664×928	Widescreen, landscapes
9:16	928×1664	Mobile, portrait stories
4:3	1472×1104	Traditional photos
3:4	1104×1472	Portrait photos
3:2	1584×1056	DSLR standard
2:3	1056×1584	Classic portrait

Performance & Hardware

Technical Specifications

Base Model: Qwen/Qwen-Image (20B parameters)
Precision: BF16 (bfloat16) for numerical stability
GPU Memory Required: 55-60GB
Model Source: Direct HuggingFace Hub download

Why BF16?

BF16 prevents NaN/Inf errors in scheduler calculations that occur with FP16, providing stable numerical computations while maintaining fast inference speeds. This precision format is optimal for the Qwen-Image model architecture.

Performance Benchmarks (H100)

Standard Mode (50 steps):

Model load: ~60-120s
Warm-up: ~0.7s
Inference (1328×1328): ~34s at 1.45 it/s
Inference (2048×2048): ~123s at 0.4 it/s

Lightning Mode (8 steps, with go_fast=true):

Inference (1328×1328): ~3.5s at 2.5 it/s (10× faster)
Inference (2048×2048): ~10s at 0.8 it/s (12× faster)

H100 vs A100 Comparison

Metric	H100 80GB	A100 80GB	H100 Advantage
Price	$5.49/hr	$5.04/hr	+9% cost
Cold Boot	~60-120s	~90-180s	~33% faster
Inference (50 steps)	~34s	~45s	~24% faster
Throughput	~105 img/hr	~78 img/hr	+35% output
Cost/Image (1hr)	$0.052	$0.065	20% cheaper

Why H100?

Despite being 9% more expensive per hour, H100 generates 35% more images, resulting in 20% lower cost per image with significantly faster inference.

Hardware Requirements

Supported:

✅ NVIDIA H100 80GB (recommended)
✅ NVIDIA A100 80GB (works, slower)

Not Supported:

❌ NVIDIA T4 (16GB) - Too small
❌ NVIDIA L40S (48GB) - Insufficient memory

LoRA Support

Loading LoRAs

Direct .safetensors files:

https://huggingface.co/user/repo/resolve/main/lora.safetensors

Archive files (automatically extracted):

ZIP, TAR.GZ, TAR.BZ2 formats supported
Automatically finds .safetensors in archives

Features

Smart Caching: Same URL reuses downloaded file (no re-download)
Dual LoRA Support: Combine 2 LoRAs simultaneously with independent scales
Automatic Extraction: Finds .safetensors in archives
HuggingFace Support: Direct loading from HF repositories

Usage Examples

Single LoRA - Style Transfer:

{
  "prompt": "Woman portrait",
  "lora_weights": "https://hf.co/user/style/resolve/main/lora.safetensors",
  "lora_scale": 0.9
}

Dual LoRA - Concept + Style:

{
  "prompt": "Futuristic vehicle",
  "lora_weights": "https://hf.co/user/concept/resolve/main/lora.safetensors",
  "lora_scale": 0.8,
  "extra_lora_weights": "https://hf.co/user/style/resolve/main/lora.safetensors",
  "extra_lora_scale": 0.6
}

Lightning Mode ⚡

Lightning Mode uses a specialized LoRA adapter to dramatically accelerate generation with minimal quality loss.

How It Works

Automatically loads the Qwen-Image-Lightning LoRA
Reduces inference steps from 50 to 8 (6.25× fewer steps)
Optimizes cfg_scale to 1.0 for best speed/quality balance
10× faster generation on standard resolutions
12× faster on HD resolutions

When to Use Lightning Mode

Ideal for:

High-volume batch processing
Draft/preview generation
Time-sensitive applications
Cost-sensitive workflows where speed matters

Use Standard Mode for:

Final production images
Maximum quality requirements
Complex scenes needing more refinement

Usage Example

{
  "prompt": "Mountain landscape at sunset",
  "go_fast": true,
  "aspect_ratio": "16:9"
}

Performance Comparison (1328×1328):

Standard (50 steps): ~34s → ~$0.052 per image
Lightning (8 steps): ~3.5s → ~$0.005 per image (Significantly faster and more cost-effective)

Quality Trade-offs

Lightning Mode maintains good quality for most use cases, but you may notice:

Slightly less fine detail in complex textures
Marginally reduced color accuracy in certain scenes
Faster convergence (good for most prompts, may need iteration for complex concepts)

For critical production work, compare both modes to determine the best fit for your needs.

Best Practices

Cost Optimization

Use Lightning Mode for Cost Savings

Lightning mode offers the most cost-effective generation for batch processing.
Standard mode becomes more economical for sessions with 100+ images.
Use prediction API for better control.
Plan generation bursts rather than sporadic requests.

Avoid Playground for Testing

Each manual test = full cold boot cost.
Use API for development/iteration.
Reserve Playground for final verification only.

Optimize Parameters

Start with 30-40 steps (faster, good quality).
Increase to 50 for production.
Test with lower steps first.

LoRA Management

Host LoRAs on reliable CDN or HuggingFace.
Reuse same URLs to benefit from caching.
Test LoRA scales between 0.6-0.9.
Combine up to 2 LoRAs for complex styles.

FAQ

Q: Why is cold boot so long?
A: We load the official model directly from HuggingFace to guarantee you’re using the exact HF weights, not pre-baked versions. This provides transparency and ensures you have the official model. Alternative versions may use pre-baked weights for faster startup.

Q: When should I use this build?
A: Use this build when you need guaranteed official HuggingFace weights, flexible LoRA support, or high-volume batch processing. Use Lightning mode for cost-sensitive workflows and Standard mode for maximum quality.

Q: What’s the key advantage of this build?
A: Guaranteed use of official HuggingFace model weights and full transparency about the model source. If you need to ensure you’re using the exact official model from HF via API, this build provides that guarantee.

Q: Can I use this in Playground?
A: Technically yes, but it can be expensive. Each manual test triggers a full cold boot. We strongly recommend API usage only.

Q: How do I minimize costs?
A: Use Lightning mode for maximum cost-efficiency. For Standard mode, batch 100+ images per session to amortize the cold boot cost. Use the API instead of the Playground and generate in bursts.

Q: What’s the minimum viable usage?
A: Lightning mode is cost-effective at any scale. For Standard mode, batching more images makes it more economical. For single-image tasks where the absolute fastest startup is critical, other pre-warmed models may be more convenient.

Q: Do LoRAs stay loaded between predictions?
A: Yes! Same LoRA URL is cached and reused instantly across requests without re-downloading.

Q: Can I use multiple LoRAs?
A: Yes, up to 2 LoRAs simultaneously with independent scale controls.

Troubleshooting

“Out of Memory” Error:

Use H100 or A100 80GB (60GB+ required)
Set num_outputs to 1 (not 2)
Try smaller aspect ratios

Slow Generation:

Normal timings on H100:
Standard mode: ~34s per image (50 steps, 1328×1328)
Lightning mode: ~3.5s per image (8 steps, 1328×1328)
HD images take longer (see Performance Benchmarks)
Cold boot adds ~60-90s on first prediction
With LoRAs: expect additional time for first-time download/loading
Subsequent LoRA uses are instant (cached)

LoRA Not Loading:

Verify URL is publicly accessible
Ensure file is .safetensors format
Test URL in browser first
Check archive contains .safetensors file

Cost Optimization:

Use Lightning mode for significant cost savings.
Standard mode is most cost-effective for 50+ images per session.
Batch requests to amortize cold boot cost.

Model Information

Base Model: Qwen/Qwen-Image
Parameters: 20 billion
Architecture: Diffusion transformer
License: Apache 2.0
Paper: arXiv:2508.02324

🔗 Official Replicate Version (alternative with pre-baked weights)
📖 Qwen-Image GitHub
📚 Replicate API Documentation

⚠️ Final Reminder: This build provides official HuggingFace weights via API. Use Lightning mode for high-throughput generation, or Standard mode for maximum quality on larger batches. Alternative versions with pre-baked weights may offer a faster cold start and stability: https://replicate.com/qwen/qwen-image