zsxkib / hunyuan-video2video

A state-of-the-art text-to-video generation model capable of creating high-quality videos with realistic motion from text descriptions

  • Public
  • 2.3K runs
  • H100
  • Weights
  • Paper
  • License

Input

*file

Input video file.

string
Shift + Return to add a new line

Text prompt describing the desired output video style. Be descriptive.

Default: "high quality nature video of a excited brown bear walking through the grass, masterpiece, best quality"

integer
(minimum: 64, maximum: 2048)

Output video width (divisible by 16 for best performance).

Default: 768

integer
(minimum: 64, maximum: 2048)

Output video height (divisible by 16 for best performance).

Default: 768

boolean

Keep aspect ratio when resizing. If true, will adjust dimensions proportionally.

Default: true

integer
(minimum: 1, maximum: 150)

Number of sampling (denoising) steps.

Default: 30

number
(minimum: 1, maximum: 20)

Embedded guidance scale. Higher values follow the prompt more strictly.

Default: 6

number
(minimum: 0, maximum: 1)

Denoise strength (0.0 to 1.0). Higher = more deviation from input content.

Default: 0.85

integer
(minimum: 1, maximum: 20)

Flow shift for temporal consistency. Adjust to tweak video smoothness.

Default: 9

integer

Set a seed for reproducibility. Random by default.

Including frame_rate and 8 more...

Output

Generated in

Run time and cost

This model costs approximately $0.57 to run on Replicate, or 1 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia H100 GPU hardware. Predictions typically complete within 7 minutes. The predict time for this model varies significantly based on the inputs.

Readme

HunyuanVideo Video-to-Video Generation Model 🎬

A video-to-video implementation of Tencent’s HunyuanVideo framework, powered by Jukka SeppΓ€nen’s (@Kijaidesign) ComfyUI nodes. This model specializes in transforming source videos into new high-quality videos while maintaining temporal consistency and motion quality.

Implementation ✨

This Replicate deployment integrates: - Tencent’s HunyuanVideo framework - @Kijaidesign’s ComfyUI-HunyuanVideoWrapper - cog-comfyui for Replicate deployment

Model Description πŸŽ₯

The model leverages HunyuanVideo’s dual-stream architecture and 13 billion parameters to transform input videos into new styles and contexts. Using a spatial-temporally compressed latent space and sophisticated text encoding through large language models, it maintains high fidelity while allowing creative transformations of source material.

Key features:

πŸŽ₯ High-quality video-to-video transformation πŸ“ Support for various aspect ratios and resolutions 🎯 Excellent temporal consistency and motion preservation 🎨 Style transfer capabilities while maintaining motion coherence πŸ”„ Works with diverse source video types

Predictions Examples πŸ’«

The model excels at transformations like: - Converting daytime scenes to night - Changing weather conditions in landscape videos - Transforming art styles while preserving motion - Maintaining consistent style across video frames

Limitations ⚠️

  • Generation time increases with video length and resolution
  • Higher resolutions require more GPU memory
  • Complex transformations may require careful prompt engineering
  • Source video quality impacts output results
  • Memory usage depends on input resolution and frame count

Credits and Citation πŸ“š

This implementation relies on the following key works:

  1. Original HunyuanVideo by Tencent:
@misc{kong2024hunyuanvideo,
      title={HunyuanVideo: A Systematic Framework For Large Video Generative Models}, 
      author={Weijie Kong, et al.},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Special acknowledgment to Jukka SeppΓ€nen (@Kijaidesign) for the excellent ComfyUI implementation that makes video-to-video generation possible.


Follow me on Twitter/X