cjwbw / controlvideo

Training-free Controllable Text-to-Video Generation

Run time and cost

Predictions run on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 4 minutes. The predict time for this model varies significantly based on the inputs.


Official PyTorch implementation of “ControlVideo: Training-free Controllable Text-to-Video Generation”

ControlVideo adapts ControlNet to the video counterpart without any finetuning, aiming to directly inherit its high-quality and consistent generation


If you make use of our work, please cite our paper.

This work repository borrows heavily from Diffusers, ControlNet, Tune-A-Video, and RIFE.

There are also many interesting works on video generation: Tune-A-Video, Text2Video-Zero, Follow-Your-Pose, Control-A-Video, et al.